TLDR Data 2026-04-09
Netflix’s Time-Series Caching 🗄️, Airflow 3.2 Released 🚀, Meta’s Pipeline Context 🗺️
Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale (10 minute read)
Netflix built a caching layer in front of Apache Druid to stop answering the same time-series queries by intercepting queries at the Druid Router, parsing the query structure, and storing results in fine-grained time buckets using a Cassandra-backed cache. For overlapping windows, it serves cached data for settled intervals and only fetches the missing recent tail from Druid. It uses exponential TTLs and gap-aware merging to balance freshness with cache hit rates.
How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines (8 minute read)
Meta built a precompute engine using a swarm of over 50 specialized AI agents to map and document "tribal knowledge" across their massive data pipelines. The system generates concise, high-quality context files that capture non-obvious patterns, module purposes, dependencies, failure modes, and undocumented conventions, following a “compass, not encyclopedia” principle.
Proxy-Pointer RAG: Achieving Vectorless Accuracy at Vector RAG Scale and Cost (23 minute read)
Proxy-Pointer RAG is motivated by the PageIndex critique that retrieval in real enterprise documents is usually a structure navigation problem, not just a semantic-similarity problem: the right answer often depends on finding the right section, table, or path in a hierarchy rather than the most similar chunk. It brings that insight into a scalable vector pipeline by adding structural proxies such as document trees, ancestry paths, and pointer-like cues, aiming to close the accuracy gap between flat vector RAG and more reasoning-heavy vectorless approaches.
Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update (11 minute read)
dbt Labs compares two approaches for letting AI/LLMs query data: raw Text-to-SQL (LLM directly generates SQL against tables) versus querying through the dbt Semantic Layer (which provides centrally defined, governed metrics and business logic). Even with the most advanced LLMs, the dbt Semantic Layer delivers higher accuracy, consistency, and governance by giving the model clean, pre-defined business metrics.
Is Data Visualization dead? (4 minute read)
AI hasn't killed data visualization, but it has commoditized the craft, removing much of the effort, creativity, and satisfaction that once made it enjoyable. As a result, the field has shifted from specialized roles to broader, AI-enabled generalist work, leaving dataviz more as a hobby than a core profession.
SQL Superpowers: Your Streaming Delta Lake Pipeline Has Been Quietly Falling Apart (5 minute read)
High-throughput Delta Lake streams can silently degrade as millions of tiny files accumulate, even while the pipeline stays green. Query latency can jump tenfold, and storage costs can rise 40% or more because Spark and cloud storage spend most of their time on file metadata, not data. The fix is operational: schedule OPTIMIZE on recent partitions, use VACUUM to delete tombstones, and monitor transaction log growth and file sizes. Auto Compaction and Optimized Writes help, but don't replace scheduled compaction at extreme scale.
❌ Stop doing SysAdmin. ✅ Start doing Data Science (Sponsor)
Complex AI setups shouldn't risk your sensitive data or your funding. Launch pre-configured deep learning workspaces in one click with
RONIN. Build with enterprise-grade security for your models and strict financial guardrails that stop runaway cloud costs, launch and scale environments instantly, and stay laser focused on AI and data.
Simplify your life with RONINApache Airflow 3.2.0: Data-Aware Workflows at Scale (6 minute read)
Apache Airflow 3.2.0 adds asset partitioning for data-aware scheduling, so downstream DAGs trigger only for the exact partition that changed rather than every upstream update. It also introduces experimental multi-team support for isolating DAGs, connections, variables, pools, and executors in a single deployment, plus synchronous deadline alert callbacks via the executor. Rendered task instance field cleanup is now about 42x faster for heavily mapped DAGs, and PythonOperator now supports async callables.
When Every Bit Counts: How Valkey Rebuilt Its Hashtable for Modern Hardware (37 minute video)
Valkey (a Redis fork) completely redesigned its core hash table to better match modern CPU cache behavior and reduce memory overhead with key changes including moving from pointer-heavy linked-list collision handling to a cache-friendly open-addressing design inspired by Swiss tables, embedding keys and metadata directly to eliminate most pointers, and hybrid chaining for full buckets and smart prefetching.
Introducing Metrics SQL: A SQL-based semantic layer for humans and agents (8 minute read)
Rill's Metrics SQL creates a SQL-native semantic layer where business metrics are defined once and queried consistently across dashboards, tools, and AI agents, eliminating metric drift. It enables deterministic, secure, and high-performance analytics by compiling simple metric queries into optimized database SQL.
Dagster vs Airflow 3 (Reddit Thread)
Airflow is framed as the safer, more mature default for cron-style batch orchestration: it is widely adopted, scalable, and a good fit when dependency logic already lives in code. Dagster is praised for its modern UI, dbt and asset model, and easier day-to-day developer experience.
Simplest hash functions (11 minute read)
For many practical uses, the simplest hash functions that provide good enough distribution are often the best choice. While security is not a concern, the naive addition hash performs decently on long text, and adding a single foldmul step brings collision rates close to SHA-256 on hash tables while using almost no code or CPU.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email