TLDR Data 2026-05-11
Docs vs Skills Reality 🤖, Scenario Models Need Guardrails 🏛️, Rust-Powered AI Storage 🦀
When the Uncertainty Is Bigger Than the Shock: Scenario Modelling for English Local Elections (13 minute read)
Scenario models are data products, not predictions. Model uncertainty can be larger than the scenario shock, so point forecasts and rankings are misleading without intervals. Version assumptions, freeze model artifacts, store residuals, expose uncertainty bands, log guardrails, and plan post-event audits.
How Discord Automates ScyllaDB Clusters at Scale (6 minute read)
Discord built Scylla Control Plane (SCP), a robust automation framework in Rust, to manage complex operational tasks across its large fleet of ScyllaDB clusters. SCP uses declarative YAML workflows composed of idempotent tasks and explicit safety conditions, with configurable parallelism and zoning constraints, enabling safe operations like rolling restarts, cluster expansions, and standing up shadow clusters in under two hours.
Enhancing Flink Deployment with Shadow Testing (3 minute read)
Grab's data streaming team added a Shadow Testing stage to its Apache Flink deployment pipeline to eliminate the ~10-minute rollback downtime caused by production-only failures. A shadow job runs in parallel with the main job in production, using distinct consumer groups and sink destinations to avoid interference while comparing metrics and outputs. The approach increases deployment frequency and reduces change failure rate.
The Roadmap to Mastering Tool Calling in AI Agents (7 minute read)
As most agent failures happen in the tool layer rather than in reasoning, reliable production agents require precise tool definitions as contracts, robust error handling with structured errors and circuit breakers, strategic parallelization, managing tool catalog size, and targeted evaluation beyond simple end-to-end success.
From Data Catalogs to GraphRAG-Ready Data Product Portfolios (7 minute read)
GraphRAG is pushing enterprise AI beyond vector search by using explicit relationships between data products, entities, objectives, KPIs, and use cases. Traditional catalogs are insufficient because they stop at discovery, while AI assistants need machine-readable business context to answer portfolio questions like ownership, fit-for-purpose, and coverage gaps. Catalogs will organize the portfolio, graphs will connect it, and AI assistants will use the graph to reason across it.
We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs. The Answer Is More Complicated Than We Expected (6 minute read)
Wix ran 250 evaluations to test whether AI skills outperform documentation when agents perform developer tasks and concluded that agent-optimized docs proved to be a strong foundation, while skills can deliver clear wins on token usage and speed when perfectly maintained and aligned. However, small errors, staleness, or over-prescription could dramatically increase cost and reduce agent flexibility.
How BigQuery actually executes a query (and why most optimization advice misses half the picture) (10 minute read)
BigQuery performance comes down to understanding its execution model: queries run as parallel stages across slots, and the main hidden cost is shuffle, not just bytes scanned. The Execution Details panel reveals stage-level slot-ms, max vs average compute time, and join strategy, making it possible to spot skew, fan-out, and expensive hash joins.
Flowfile (GitHub Repo)
Flowfile is a visual ETL tool built around Polars. Design pipelines on a drag-and-drop canvas, or define them in Python with a Polars-like API. Visual workflows can export to standalone Python/Polars code, avoiding classic low-code lock-in. It also includes a Delta-backed catalog, SQL editor, scheduler, parameters, and sandboxed Python kernels.
Data Landscape (Tool)
Data Landscape is an interactive map of open standards behind modern data architecture: contracts, schemas, semantics, file/table formats, movement, processing, catalogs, lineage, query, quality, observability, policies, and AI interfaces.
HelixDB (GitHub Repo)
HelixDB is an open-source Rust database for AI apps that combines graph, vector, document, KV, and relational-style storage, with built-in MCP, embeddings, RAG tooling, and type-safe queries.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email