TLDR Data 2026-05-21
AI’s Uneasy Promise ⚖️, Public Data Simplified 🔎, Evals Before Experiments 🧪
mondayDB 3 – Solving HTAP for a Trillion-Table System (21 minute read)
mondayDB 3 is an HTAP system designed to handle over a trillion dynamic, constantly evolving tables with highly flexible schemas. They replaced their MySQL + JSON architecture with a CQRS-based Lambda architecture powered by DuckDB: immutable snapshots in object storage, an external WAL for real-time changes, and a soft-stateful serving layer that syncs + queries local DuckDB files on every read.
The Evolution of Cassandra Data Movement at Netflix (8 minute read)
Netflix replaced its Cassandra-to-Iceberg movement engine with a layered platform that reads backups directly from S3, converts them to Spark DataFrames, and lets each data abstraction build its own optimized connector. The engine moves about 3 PB/day, migration uses shadow validation, enhanced observability, and a Maestro Decider fallback to the prior solution, enabling a transparent cutover with zero downstream code changes.
How We Cut BigQuery Slot Usage by 90% On One Of Our Most Resource Hungry Service After a Production Outage (13 minute read)
Teads dramatically cut BigQuery slot usage by 90%+ on their Audience Planning service through application fixes (request coalescing with Redis distributed locks to eliminate duplicate queries, fail-fast validation for huge filters, and rewriting large IN clauses as semi-joins) combined with data model optimizations (compressing data types, precomputing repeated work, and an improved partitioning strategy), reducing the effective table footprint by ~95%.
Returning to life! (6 minute read)
AI is both genuinely empowering for data science. It makes programming, translation, voice input, and broad learning more accessible. However, it is genuinely harmful through environmental cost, copyright issues, wealth concentration, shallow thinking, and unequal access. The tension cannot be neatly resolved, but data science leaders still need to engage with AI seriously so they can help people use it well.
Better Experiments with LLM Evals — A funnel, not a fork (4 minute read)
Treat LLM evals and online A/B experiments as a funnel by using LLM judges early to verify quality (relevance, tone, and coherence) and filter out weak ideas before they consume experiment resources. This raises the success rate of experiments. Running evals on experiment results creates a feedback loop that continuously calibrates and improves the judges themselves.
What's Easy Now? What's Hard Now? (4 minute read)
The long-term capabilities of coding agents will be determined more by the quality of feedback loops than by raw model intelligence. Tasks with fast, accurate, automated feedback (e.g. building high-performance databases with formal specs) will become surprisingly “easy” for agents, while tasks reliant on slow, subjective human feedback will remain relatively “hard.”
The pipeline tax is breaking enterprise AI at agent scale (5 minute read)
Enterprise AI is hitting a “pipeline tax”: moving data through warehouses, lakehouses, vector DBs, RAG layers, and orchestration stacks adds latency, governance drift, and audit pain, with data copied up to 4 times and regulated answers taking weeks to reconstruct. The emerging solution is to bring agents to the data and make governance native to the data layer, with SQL database, MCP, and Iceberg as core pieces. The same shift is reframing migration as a continuous AI-driven capability rather than a one-off project.
OpenData (Tool)
OpenData is an open-core platform that makes public datasets easy to search, join, query, visualize, and share through one clean API.
WrenAI (GitHub Repo)
WrenAI is an open-source context layer that helps AI agents understand business data, retrieve the right semantics, and generate governed, reliable SQL across existing data stacks.
Monitoring Cortex Agent Performance With Trace Data (8 minute read)
By querying structured observability events and tracking key metrics at the span level, such as token consumption, duration/latency, and error rates (planning, tool calls, and response generation), teams can effectively monitor agents in production to detect issues like context bloat in multi-turn conversations, token spikes, slow tool calls, or shifts in question complexity.
What data agent benchmarks do and don't tell us (6 minute read)
AI Council showed the data/AI divide is collapsing: most vendors now position themselves as AI infrastructure layers for context retrieval, orchestration, or inference, and new systems like LanceDB are being built natively for LLM and multimodal workloads. Benchmarking is shifting too: dbt Semantic Layer tests, ADE-bench, and 90-day simulations all suggest agents perform best on well-specified, context-rich tasks and improve when stateful, cross-system context from GitHub, Slack, Notion, and dbt is available. The next major constraint is token and compute efficiency.
Protocols for transactional usage of object storage (8 minute read)
Object storage can support serializable OLTP if you build on three write primitives (atomic PUTs, conditional PUT If-Match/If-None-Match, and strongly consistent LISTs) and three read primitives (atomic GETs, conditional GET If-None-Match, and consistent listing). The core tradeoff is safety versus contention cost, and you need safe garbage collection to prevent the storage from growing indefinitely.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email