TLDR Data 2026-05-14
DuckDB Goes Remote π¦, When Lakehouses Guess β, Netflix Tames Data Governance π¬
Data Projects: Managing Data Assets at Netflix Scale (6 minute read)
Netflix introduced Data Projects to replace brittle ACLs and human-owned workflow identities across millions of tables and thousands of jobs. Projects group tables, workflows, secrets, and assets under durable team-owned app identities, with scoped roles and tokens to reduce permission churn.
When 36,000 Tiny Files Break Your Spark Pipeline: A Deep Dive into S3 DNS Exhaustion and the Small File Problem (9 minute read)
Thousands of tiny Parquet files on S3 can break Spark reads with UnknownHostException, even when networking is fine, by overwhelming DNS, S3 LIST/GET calls, and driver/task metadata. Spark partition tuning can help stabilize reads, but the real fix is compaction and table formats like Delta Lake or Iceberg.
Why your AI agent has amnesia and why forgetting is the fix (16 minute read)
Enterprise AI agents fail in long workflows because they reset, lose context, and rely on bloated prompts or flat vector search. Microsoft's memory architecture uses consolidation, forgetting, and delayed maturation to keep high-value events, reaching 97.2% retention precision and stabilizing around 400 to 500 memories.
Migrating Data Ingestion Systems at Meta Scale (8 minute read)
Meta migrated its massive data ingestion system from legacy customer-owned pipelines to a simpler self-managed service using a phased Shadow β Reverse Shadow β Cleanup lifecycle, row count and checksum checks, automated promotion tooling, custom debugging infrastructure, and rollback mechanisms to prevent bad CDC data propagation.
We need to talk about dbt (5 minute read)
dbt's growth has created tension between its practitioner-led roots and enterprise ambitions. dbt must better protect community trust, improve dbt Core, strengthen integrations, fix developer ergonomics, and make dbt Cloud feel like a real IDE. The risk is not adoption, but alienating the users who made dbt valuable.
April 2026 PDC State of Data Modeling Survey Results Are In! (9 minute read)
A 334-response April 2026 pulse survey shows data modeling pain is overwhelmingly organizational, not tooling: 28.1% want training, 24.6% clearer requirements, 21.6% more time, 21.0% dedicated ownership, and only 4.8% better tools. Modeling is often owned by whoever builds pipelines (42.5%), while only 19.2% have a dedicated modeler or architect, and 68.3% refactor only occasionally or rarely. Teams with enforced standards are about 5x more likely to say their models hold up.
Lakehouse statistics and why query engines get lost (6 minute read)
Lakehouse query engines often struggle because the statistical metadata they need to plan queries, skip irrelevant data, size joins, and handle skew is optional, inconsistent, or missing across formats like Iceberg, Delta Lake, and Parquet. Without reliable stats, engines are forced to guess, leading to bad query plans, wasted reads, higher costs, memory issues, and slow or failed queries.
Can Kafka Queues Make Consumers Faster? Part 2: Head-Of-Line Blocking (4 minute read)
Kafka Queues (Share Groups) shine when consumer processing involves delays or external I/O that causes Head-Of-Line Blocking. By allowing more consumer instances than partitions, share groups enable linear scaling of throughput (tested up to 8x with 32 instances) with no noticeable per-instance overhead, making them very effective for I/O-bound workloads.
Quack: The DuckDB Client-Server Protocol (12 minute read)
Quack is a new client-server protocol that lets separate DuckDB instances communicate over HTTP instead of only running in-process. It uses a request/response model with custom application/duckdb serialization, default token-based auth, localhost binding, and no SSL by default for local use, while supporting remote connections through standard HTTP infrastructure.
Strong views on PostgreSQL VIEWs (19 minute read)
Views are just stored rewrite rules (macros) that get expanded at query time. They behave like tables for simple cases, but create hidden complexity through nested spirals, fragile dependencies on attribute numbers, painful schema changes, and limited writability, often leading to the classic advice: βuse them, but don't treat them like tables.β
Agentic search models (3 minute read)
Agentic search models are emerging to orchestrate the full retrieval workflow, replacing today's brittle stack of embeddings, rerankers, query classifiers, and BM25 with thinner backend primitives. Unlike frontier LLMs that handle the β80% case,β models trained specifically for search can encode domain-specific intent and the βlast 20%β of retrieval nuances, improving relevance in narrow contexts like e-commerce or job search. Early examples such as SID-1 and Waldo emphasize smaller size and lower latency.
Stop Starting Data Projects (9 minute read)
Many data projects fail not because of technical issues, but because engineers jump straight into building without properly understanding the stakeholders' real needs and processes. Instead, start by asking the stakeholder to walk through their current workflow, create a one-sentence Definition of Done, ship an ugly MVP, and iterate on it to turn vague requests into shipped, adopted work while dramatically reducing wasted effort.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email