TLDR Data 2026-03-26
Scaling Trino Simply π, Snowflake Join Trap π, Death of BI Layers π
Why Your Snowflake Joins Are Slow: Fix OR Joins Fast (12 minute read)
Disjunctive joins (using OR in join conditions) break Snowflake's hash join optimization, forcing expensive Cartesian products and massive performance slowdowns. The fix is to rewrite them as separate equi-joins, restoring efficient execution and often delivering 100β200x speedups.
Volga - Data Processing for Real-Time AI/ML (20 minute read)
Volga is now a fully Rust-based engine for real-time AI/ML, replacing the earlier Python+Ray core to get a simpler, higher-performance runtime and tighter control over execution and state. It unifies streaming, batch, and request-time compute in one standalone system, aiming to eliminate the usual stitching across Flink, Spark, Redis, and custom services while keeping point-in-time-correct state inside the engine. The key building blocks are Apache DataFusion for SQL pipelines, Apache Arrow for execution semantics, and SlateDB for S3-backed state.
Beyond the Vector Store: Building the Full Data Layer for AI Applications (7 minute read)
Relying solely on a vector database is no longer sufficient for production AI applications, especially RAG, and agentic systems. A complete AI data layer requires five integrated components: vector store, metadata & filtering, graph layer, cache, and governance & observability, making hybrid architectures (vector + graph + relational) essential for achieving better accuracy, lower cost, and true production readiness.
Future Casting the Modern Data Stack (20 minute read)
AI-driven advancements are fundamentally challenging the Modern Data Stack model, with LLMs now capable of generating high-quality SQL, automating ETL pipelines, and creating sophisticated data visualizations, drastically reducing manual query-writing and traditional BI tool usage. Data warehouse vendors face commoditization pressures, while consolidation and integration across the stack accelerate. The emerging data platform paradigm is likely an agent swarm for data management backed by a query engine powering the analytics.
Where Is the Right Place to Catch Data Volume Anomalies? (6 minute read)
Monitoring data volume anomalies at the data warehouse layer (rather than solely at the source) is critical when sources are diverse, failure modes are silent, and real-time streams lack batch boundaries. This centralized approach creates a unified detection and communication point, bridging upstream producers and downstream consumers while providing actionable data health signals. Introducing a suppression layer for known, context-specific anomalies minimizes alert fatigue without incurring technical debt.
Databricks Metric Views and the Reality of the Semantic Layer (5 minute read)
Databricks is entering the semantic layer space with Metric Views, a way to centrally define business metrics directly in Unity Catalog on top of Delta tables. However, it's still quite limited compared to mature semantic layers, supporting only simple aggregations, and lacks complex business logic, calculated metrics with dependencies, and advanced dimensional modeling.
Apache Iceberg Rust 0.9.0 Release (2 minute read)
Apache Iceberg has released iceberg-rust 0.9.0, introducing a trait-based storage architecture that decouples the library from specific storage backends, facilitating easier integration and extension. This version features major performance improvements for Arrow reads, expanded DataFusion support, and upgrades decimal handling to 38-digit precision with the fastnum crate.
When upserts don't update but still write: Debugging Postgres performance at scale (11 minute read)
Datadog hit a surprising Postgres performance issue while cleaning up millions of ephemeral hosts: a simple upsert to update the "last seen" timestamp doubled disk writes and quadrupled WAL syncs. The cause was ON CONFLICT DO UPDATE always acquires a row lock and writes to the WAL, even when no data actually changes. The fix is to avoid locking on no-op upserts.
Operating Trino at Scale With Trino Gateway (9 minute read)
Expedia built Trino Gateway to solve the growing complexity of managing dozens of specialized clusters as query volume, concurrency, and workload diversity exploded. Instead of forcing users to connect to different endpoints, the gateway provides a single unified connection URL that automatically routes queries to the best cluster based on smart rules.
State of Context Engineering in 2026 (10 minute read)
Context engineering has rapidly become central to AI agent design. Matured patterns, like progressive disclosure, sliding-window compression with summarization, precise context routing, agentic retrieval-augmented generation, and rigorous tool management, are now broadly adopted across platforms. Anthropic's Agent Skills and the MCP protocol have set standards for LLM-driven workflows, but tradeoffs around token cost, latency, and maintainability persist. Data teams should audit token consumption and implement hybrid compression and routing early to ensure agent reliability and cost efficiency.
What COVID did to our forecasting models (12 minute read)
The COVID-19 pandemic completely broke Airbnb's demand forecasting models in March 2020. The models, trained on stable historical patterns, failed to handle massive swings in booking volume, unpredictable cancellation spikes, and the collapse of the normal relationship between booking date and travel date (lead-time composition). To solve this, Airbnb decoupled forecasting into two separate parts: gross booking metrics on the booking-date axis and lead-time composition (the proportion of bookings that turn into trips on different future dates).
The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents (12 minute read)
Data science work focused on building and tuning models is rapidly becoming obsolete as AI agents and foundation models take over. Today's data scientists instead focus on four higher-level responsibilities: defining business problems and metrics, designing evaluation frameworks and guardrails, curating high-quality data, and building reliable agent systems with prompts, tools, and human oversight.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email