TLDR Data 2026-05-04
Zero-Downtime at Stripe π³, Trimming ML Feature Bloat βοΈ, Less Repetitive Data QA β
Data Mesh at Grab Part II: The Foundational Tools behind Certification (10 minute read)
Grab operationalizes data mesh certification with an event-driven metadata graph built on DataHub, Kafka-backed metadata events, DataHub Actions for continuous certification, Temporal for validation workflows, and Airflow/Lighthouse pipeline-completion events to trigger quality checks. The key idea: trust is computed from live ownership, lineage, contracts, SLAs, and test health, not manually assigned, and contract rules link to concrete health endpoints.
Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer (14 minute read)
Pinterest built Feature Trimmer to dynamically remove low-value or redundant features from large-scale ML training and inference requests, dramatically reducing network bandwidth usage and cost while maintaining model performance. It combines offline feature importance analysis with online trimming logic, resulting in substantial network bandwidth reduction and improved client-side latency.
How we rebuilt search ranking at Faire with deep learning (11 minute read)
Faire rebuilt its search ranking stack from XGBoost to deep learning to better optimize competing goals like relevance, freshness, brand discovery, and cross-surface consistency. The migration required reworking data pipelines, observability, and production serving, including custom Docker-based infrastructure, shared-memory embeddings, and CPU sandboxing to cut startup latency from 20β30 minutes to a few minutes. The new stack delivered measurable gains, including a ~2% order volume boost on Product Search.
Stripe's DocDB: How Zero-Downtime Data Movement Powers Trillion-Dollar Payment Processing (44 minute video)
Stripe runs DocDB on open-source MongoDB to support 5 million QPS, 2,000+ shards, and 99.9995% reliability while processing $1.4T in payments in 2024. Its zero-downtime data movement platform enables horizontal sharding, version upgrades, and single-tenant/multi-tenant migrations without interrupting traffic using point-in-time snapshots, CDC-based replication, and version-gated cutovers.
How We Built an AI Second Brain for 60K Knowledge Workers (8 minute read)
Meta built an internal AI Second Brain to help its knowledge workers quickly find, synthesize, and reason over vast amounts of internal company information and documents. The system combines retrieval-augmented generation (RAG), advanced search, and agentic capabilities, with careful attention to privacy, accuracy, and enterprise-grade controls.
We automated data validation β Here's how we did it (12 minute read)
AI is becoming useful for analytics engineering not by replacing human judgment, but by removing the repetitive audit work around validation. The best pattern is agent-assisted, evidence-heavy workflows where AI runs checks, investigates changes, shows its work, and humans still decide what is acceptable.
Five Worlds of Data Engineering (10 minute read)
Data engineering advice often fails because it's written for one of five very different operating models: startup-style analytics teams, legacy enterprise environments, outcome-critical product/data systems, regulated businesses, or platform/data-mesh organizations. Each has different priorities (speed, stability, consequence, auditability, or adoption) and practices that are βbestβ in one can be dangerous in another. Classify your environment before applying guidance, so architecture, governance, and delivery practices match the actual constraints.
Your model scores great on evals. But they were built for English. Does that performance hold in Arabic? (Sponsor)
Datanomy (GitHub Repo)
Datanomy is a terminal tool for inspecting Parquet files. It shows schemas, metadata, data, statistics, and internal structures in an interactive view.
What Held Up at 3 AM: One Engineer's RAG Case Study (17 minute read)
Most RAG systems fail in production because teams hard-code a vector DB, embedding model, and chunking strategy without observability or repeatable evals. Weave CLI addresses this by unifying 11 vector databases, 5 embedding providers, and swappable agents behind a single config-driven interface. OpenTelemetry and Opik tracing is baked in from day one.
Handling Schema Issues in Polars (6 minute read)
Polars has strong built-in support for schema evolution for changes like new or missing columns, type drifts, and breaking changes. Depending on the data format, use parameters such as missing_columns="insert", schema_mode="merge", ScanCastOptions, and diagonal_relaxed concat, so pipelines don't break when upstream schemas change.
Bottling the River: Apache Fluss on EKS (6 minute read)
Apache Fluss is an βindexable Kafkaβ that combines horizontally scalable streaming ingestion with columnar storage, primary-key tables, CDC, and optional tiering to S3 or lakehouse formats like Iceberg and Paimon. In production on EKS, integrating it with Flink requires fixing several issues, such as missing connector JARs, S3 credential/delegation-token issues, and extra dependencies. Fluss can significantly simplify stateful streaming and lookup workloads, but 0.9-era production use still needs careful operational tuning.
Effective KV Compression with TurboQuant (4 minute read)
TurboQuant is a quantization and compression algorithm for Key-Value (KV) caches in large language models and vector search systems. It uses PolarQuant to first map vectors into polar coordinates, followed by QJL (Quantized Johnson-Lindenstrauss), which applies a minimal 1-bit correction to remove hidden biases, enabling compression down to ~3 bits per value with virtually no loss in accuracy.
Introducing Neo4j Agent Skills (3 minute read)
Neo4j has released a first wave of Agent Skills to keep coding agents current with Cypher 25 and recent GQL-aligned syntax, including SHORTEST 3, REPEATABLE ELEMENTS, quantified path patterns, and path projections.
Does ELT vs. ETL Even Still Matter? (6 minute read)
Cloud data platforms like Snowflake, BigQuery, Redshift, and Databricks have made ELT the default because it is simpler, faster to iterate on, and lets teams use scalable warehouse compute for transformations.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email