TLDR Data 2026-05-18
Query Planning Slowdown 🐢, Airbnb’s Data Mesh 🧩, Ontology-Driven Policies 🧬
Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse (9 minute read)
Cloudflare's shift to per-tenant retention in a massive ClickHouse “Ready-Analytics” table exposed an unexpected scaling limit: query planning, not I/O or scan volume, became the bottleneck as parts per replica grew. Tracing showed 45% of leaf query CPU time in part filtering. Switching to a shared lock and then a shared-read cache removed most of the contention and cut query latency sharply.
AWS Outage May 2026: Lessons for Database Disaster Recovery (10 minute read)
A major AWS US-EAST-1 outage in May was triggered by a data center overheating event in a single availability zone, causing multi-hour disruptions for high-profile services like Coinbase. The incident highlighted the critical difference between Multi-AZ high availability (which failed to protect latency-sensitive workloads) and true cross-region disaster recovery.
Viaduct 1.0 and the Future of Airbnb's Data Mesh (5 minute read)
Viaduct 1.0 is Airbnb's open-source data-oriented service mesh built on GraphQL. It provides a single unified schema for accessing any data source across the company while enabling decentralized development through multi-tenant modules as teams contribute their own schema and resolvers without operating separate GraphQL services, striking a balance between a monolithic GraphQL server and full federation.
The Modern Data Stack is Overcomplicated: Data Ingestion (17 minute read)
Data ingestion looks simple, but the wrong choice can create hidden costs through broken connectors, schema drift, over-engineering, and wasted engineering time. The best approach is usually a hybrid: managed connectors for standard SaaS, streaming only when low latency truly matters, and custom pipelines for niche or legacy sources.
Welcome to ORDER BY Jungle (11 minute read)
PostgreSQL resolves column names and expressions in ORDER BY clauses in inconsistent ways. For example, bare identifiers (e.g. ORDER BY a) first look for aliases in the SELECT list, while any expression (e.g. ORDER BY -a) resolves against the FROM clause, leading to confusing behaviors with aliases, quoting, GROUP BY, window functions, and UNION.
Exploring schema evolution with ontology-driven propagation (4 minute read)
A plain-English ontology can act as a runtime access policy that survives schema evolution, letting an LLM classify columns column-by-column using row counts, cardinality ratios, and sampled values. The approach keeps policy separate from pipeline code, but it does not cover numeric sensitive inferences or cross-column re-identification.
A Data Layer That Won't Make You Wait (Sponsor)
ducklake-sdk (GitHub Repo)
ducklake-sdk is an alpha Rust/Python SDK for reading and writing DuckLake tables without running DuckDB. It implements the DuckLake spec in a Rust core, with Python integrations for Polars, Arrow, and DuckDB, targeting SQL-catalog metadata plus Parquet storage. Useful for embedding DuckLake access into apps, pipelines, or engines directly.
Apache Arrow as Data Interchange (5 minute read)
Apache Arrow is rapidly becoming the universal in-memory columnar format for data interchange across the modern data stack. Instead of repeatedly serializing, deserializing, and copying data between tools (Pandas → Spark → databases, etc.), Arrow enables zero-copy handoff, where systems share the exact same memory layout, dramatically reducing CPU overhead.
What Matters in Production RAG (8 minute read)
Key requirements for production RAG include smart chunking strategies (recursive, semantic, and structure-aware), robust indexing pipelines with document registries, content hashing for efficient updates, alias-based zero-downtime index switching, careful embedding model management, and strong observability with detailed tracing, chunk attribution, and retrieval quality metrics.
MinIO's MemKV promises 95% better GPU utilization by ending AI recompute tax (5 minute read)
MemKV is a petabyte-scale context memory store for AI inference designed to preserve and share session state across GPU clusters. By moving context directly from NVMe into the AI data path over 800 GbE RDMA, it targets the “recompute tax” and claims 95%+ better GPU utilization and about 50% lower cost per token on benchmark workloads.
Context pruning: cut LLM tokens without losing quality (9 minute read)
Context Pruning is the practice of selectively removing low-value tokens, sentences, or passages from an LLM's input to reduce cost, latency, and often improve output quality. It includes techniques such as token-level, sentence/chunk-level, attention-based, and dynamic layer-progressive pruning, and works best when paired with semantic caching.
Your AI agent deletes critical data: Who is responsible? (5 minute read)
AI agents that can write to production systems create a new accountability and recovery problem: a Replit agent once deleted a live database, and the real issue was the absence of clear ownership, guardrails, and rollback. With 86% of IT/security leaders expecting agents to outrun current controls, governance is a shared responsibility across architecture, security, legal, and business. Practical controls like policy boundaries, observability, human-in-the-loop triage, and explicit recovery mechanisms are essential to prevent autonomous tools from becoming enterprise-wide risk.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email