TLDR Data 2026-04-06
Netflix’s Multimodal Search 🎬, Tools Over Attention 🧠, Dashboard Graveyard 🪦
Starting in 48 hours! Learn how to bring AI to your data (Sponsor)
Is fragmented data holding back AI? Instead of moving your data into separate AI environments, you can bring AI to your data to simplify your architecture and move faster.
On April 8, senior engineering leaders from Oracle and Activision Blizzard will discuss new innovations in AI for data work. Join this exclusive webinar to learn how to:
- Use AI designed for data to drive business outcomes without added complexity.
- End data chaos with a unified foundation for transactional, analytical, and AI workloads.
- Run AI consistently across multicloud, hybrid, and on-premises environments.
- Reduce data risk and defend against new AI threats.
Attend live for the extended Q&A
Powering Multimodal Intelligence for Video Search (9 minute read)
Netflix built a multimodal video search architecture to surface key moments across hundreds to thousands of hours of footage, replacing brittle keyword search with AI-driven retrieval over characters, scenes, dialogue, and embeddings. The system uses overlapping temporal segmentation, Cassandra for high-throughput annotation storage, Kafka for asynchronous processing, and Elasticsearch for real-time querying, with one-second buckets and composite-key upserts to maintain a single source of truth.
How Datadog Redefined Data Replication (7 minute read)
Datadog reworked a Metrics Summary page that hit 7-second p90 latency because Postgres was being used for expensive search-style joins across 82,000 metrics and 817,000 configurations. The fix was to stop querying Postgres directly and instead stream changes via CDC: Debezium reads WAL, Kafka buffers updates, and a search platform serves low-latency queries. To make async replication safe at scale, Datadog added schema-migration validation plus a backward-compatible Schema Registry with Avro.
Improving storage efficiency in Magic Pocket, our immutable blob store (8 minute read)
Dropbox improved storage efficiency in Magic Pocket, its custom exabyte-scale immutable blob store, after a new service rollout increased fragmentation and storage overhead, particularly from severely under-filled volumes. With a multi-level compaction strategy combined with dynamic rate-limiting and better controls, its team reduced compaction overhead by 30–50%.
Dashboard rot as org attention grave markers (8 minute read)
Most dashboards die unused because organizational attention constantly shifts, making them short-lived artifacts of past priorities rather than durable decision tools. Dashboard sprawl is more than overproduction - it reflects limited human attention, where teams move on before sustained value or maintenance ever materializes.
Consulting the Oracle: Claude on the Future of Data (13 minute read)
Claude predicts that BI tools are becoming largely obsolete as LLMs handle natural language queries and charting better, ETL vendors facing major disruption within 18 months, data warehouses evolving into cheap, elastic compute utilities built on open table formats, and the Modern Data Stack collapsing into three layers: storage, compute, and context, while data engineers shifting from pipeline builders to “context curators”.
How to Become a Data AND AI Engineer (5 minute read)
AI is increasingly embedded in data engineering, but AI engineering still depends on strong data engineering foundations. Data modeling drives about 80% of the impact, and clear dbt descriptions, data contracts, lineage, and orchestration are effectively “context engineering” for AI. Engineers must review outputs critically because AI can surface answers without the instinct to know when not to act.
30 BI Engineering Interview Questions That Actually Matter in the AI Era (27 minute read)
BI engineering interviews are shifting from classic SQL and dimensional modeling toward governance, semantic layers, and AI-safe analytics. Core skills still matter, but the real differentiators are defining canonical metrics, enforcing data contracts and SLAs, and ensuring AI agents query governed semantics instead of raw warehouse tables. Trustworthy analytics now depends on machine-readable governance, auditability, and business context - not just dashboard building.
Next Major MCP Update Focuses on Scaling Agentic AI (3 minute read)
The next MCP specification release, due in June, will add stateless servers to help IT teams deploy AI applications at higher scale, with cloud providers able to spin up servers on demand. The roadmap also includes task support for long-running autonomous workflows, server-initiated triggers, and later additions like retry semantics, expiration policies, native streaming, and reusable domain skills. MCP SDKs already see 110 million downloads per month, underscoring rapid enterprise adoption as teams connect AI agents to systems of record behind firewalls.
NornicDB (GitHub Repo)
NornicDB is a low-latency Neo4j-compatible graph + vector database built for AI-native workloads. It combines Cypher queries with GPU-accelerated embedding search and auto-discovery of relationships. NornicDB collapses graph, vector, and memory semantics into one system, enabling evolving knowledge graphs and persistent agent memory without stitching multiple tools together.
Data Inlining in DuckLake: Unlocking Streaming for Data Lakes (13 minute read)
Data Inlining is a technique that stores small updates (inserts, deletes, or updates below a configurable threshold) directly in its catalog database instead of writing them as tiny Parquet files to object storage. This elegantly solves the classic "small files problem" in data lakes, enabling efficient, low-latency streaming workloads (like sensor data) without constant compaction jobs.
Coding Agents are Effective Long-Context Processors (17 minute read)
Coding agents significantly outperform traditional LLM and RAG approaches on long-context tasks by externalizing reasoning into executable actions, using tools like file systems, search commands, and code to iteratively explore and process massive text corpora more effectively than latent attention alone.
The hidden technical debt of agentic engineering (17 minute read)
Agentic engineering creates the same kind of hidden technical debt ML teams faced ten years ago, but at higher speed: agents are easy to build locally, then quickly become hard to run safely at production scale. The core burden sits outside the agent itself: centralized integrations, live runtime context, decision traces, an agent registry, observability, evals, feedback loops, human-in-the-loop controls, governance, and orchestration. Platform teams need visibility and standardized controls early, or they'll retrofit them after incidents, cost overruns, or data exposure.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email