TLDR Data 2025-12-22
The LLM Data Engineer ⚙️, The Stack Consolidates 🧱, Metrics Over Dashboards 🧠
Maximizing Throughput with Apache Hudi NBCC: Stop Retrying, Start Scaling (9 minute read)
Apache Hudi's traditional Optimistic Concurrency Control (OCC) struggles with high-concurrency writes on Merge-on-Read tables. Overlapping writes cause frequent conflicts, leading to retries, aborted work, and reduced throughput when mixing streaming and batch jobs. Non-Blocking Concurrency Control, introduced in Hudi 1.0, solves this by eliminating conflicts, allowing concurrent writers to append to separate log files, ordering commits by completion time with a TrueTime-like mechanism, and ensuring every write succeeds without retries.
Inside the Feature Store Powering Real-Time AI in Dropbox Dash (7 minute read)
Dropbox Dash's AI-driven search and ranking system is powered by a custom feature store architecture tailored to handle tens of thousands of parallel feature lookups per query and meet strict sub-100ms latency requirements. Integrating Feast for orchestration, Spark for computation, and an in-house DynamoDB-compatible solution for ~20ms online serving, Dropbox overcame Python concurrency bottlenecks by rewriting serving paths in Go, achieving p95 latencies of 25–35ms. Intelligent hybrid ingestion strategies optimize data freshness and I/O. This approach balances flexibility and performance, scales efficiently, and delivers consistent ranking relevance.
Data Testing like it's not 1997 (12 minute read)
Treat data quality like software quality: define what “good” means, design layered tests, and automate them across the pipeline. This cuts rework and incidents by shifting quality left while using production observability for what can't be caught earlier. Durable data quality comes from disciplined processes and ownership, not a single tool.
Alert Fatigue Is Killing Your Data Quality Strategy. Here's How to Fix It (5 minute read)
Alert fatigue in data observability occurs when teams are overwhelmed by excessive, often false-positive alerts from overly rigid or unprioritized monitoring, leading to ignored notifications and stalled data quality improvements. To combat this, adopt strategies like machine learning-based monitors that learn normal data patterns to reduce noise, focus monitoring on critical tables and data products, align alert ownership with domain teams, and implement intelligent prioritization and routing based on alert type, location, and downstream impact.
10 Catalogs, 3 ETLs, 2 Postgres, and a Partridge in a Pear Tree (14 minute read)
2025 reshaped the modern data stack through a wave of major acquisitions, most notably Fivetran merging with dbt, Confluent joining IBM, and both Snowflake and Databricks buying Postgres companies, signaling consolidation around core infrastructure. Independence and openness are becoming strategic positioning rather than defaults, with vendors racing to control more of the stack while claiming flexibility. Expect tighter integration, cultural friction in open source communities, and increasing pressure to choose platforms deliberately rather than assuming today's “modern” stack will stay modular tomorrow.
Announcing Support for GROUP BY, SUM, and Other Aggregation Queries in R2 SQL (8 minute read)
Cloudflare has added full SQL aggregation, enabling distributed analytics directly over data stored in R2 Data Catalog. Leveraging scatter-gather and shuffling techniques, aggregations efficiently scale out computation, minimize coordinator bottlenecks, and support high-cardinality queries with deterministic hash partitioning. This release brings R2 SQL closer to a full-fledged serverless OLAP solution, empowering data teams to analyze data at scale without additional infrastructure.
Da2a: The Future of Data Platforms is Agentic, Distributed, and Collaborative (6 minute read)
Traditional centralized data platforms, reliant on data engineers for ETL and queries, create bottlenecks and slow decision-making for business users. DA2A is a future paradigm of agentic, distributed, and collaborative data platforms where specialized AI agents autonomously manage domain-specific data, communicate via an Agent-to-Agent protocol, and collaborate through an orchestrator to answer complex business questions efficiently without monolithic architectures.
Why Your Next BI Renewal Should Be Your Last (16 minute read)
Legacy BI tools are being squeezed from all sides: AI-native analytics can answer questions directly, data platforms are moving up the stack, and purpose-built tools are out-innovating dashboards, while incumbents respond mainly with price hikes and weak AI bolt-ons. For data engineers and leaders, the real value is shifting from dashboards to governed, machine-readable metrics and semantic layers that can power many interfaces, including agents and custom apps. Stop renewing BI on autopilot and instead invest in agentic coding, flexible metric foundations, and tools that let teams build faster and deliver insights directly into workflows.
Snowflake Software Update Caused 13-hour Outage Across 10 Regions (4 minute read)
A breaking schema change in Snowflake's control plane caused a 13-hour outage across many regions, stopping queries and data ingestion for affected customers. The incident showed that multi-region setups do not protect against control plane logic errors, because shared global metadata can fail everywhere at once.
What Does a Database for SSDs Look Like? (6 minute read)
Databases like Postgres and MySQL were built for old hard-drive limits, where slow disk access required heavy logging, large pages, and complex single-node recovery. With fast SSDs, cloud replication, and reliable networks, many of these designs are no longer needed. A modern database should push durability and availability to distributed systems, optimize for small, fast transfers, use versioning for reads, and simplify recovery by dropping legacy single-node mechanisms.
The Rise of the LLM Data Engineer (60 minute read)
The real shift is not LLMs replacing data engineers, but data teams becoming LLM-native by redesigning workflows so humans guide, validate, and own outcomes while models automate routine ELT and transformation work. For data leaders, the value moves from writing SQL and pipelines to platform engineering, governance, and operationalizing AI safely at scale, with LLMs accelerating delivery rather than acting autonomously. Teams that learn to structure context, rules, and human-in-the-loop workflows will move faster and cheaper, while those chasing one-shot automation or fearing displacement will fall behind.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email