TLDR Data 2026-06-22
Data + AI 2026 Review 🧱, Lyft’s Golden Metrics 🏅, DuckDB 1.5.4 🦆
Write-Ahead Intent Log: A Foundation for Efficient CDC at Scale (51 minute video)
DoorDash replaced fragile CDC pipelines with WAIL after Debezium hit Cassandra scale limits. WAIL logs mutation intent to Kafka and the database, then a smart consumer verifies state, applies schema rules, and publishes events, improving recovery and scaling.
AI Agents to Make Sense of Data at OpenAI (45 minute video)
OpenAI's Kepler has moved beyond text-to-SQL into a context-rich analyst for 600+ PB. Daily Codex jobs crawl code to infer grain, lineage, freshness, usage, and hidden semantics. Scoped memories capture corrections at user, team, and global levels, while AST-normalized LLM grading checks SQL and result equivalence. Next: memory pruning, fine-tuning, and dashboard self-validation.
Metric Semantic Layer: How Lyft Governs and Scales Key Data Definitions (7 minute read)
Lyft built an internal Metric Semantic Layer to solve metric definition drift and ensure consistent business logic across teams. It enforces governance through “Golden Metrics,” dual ownership (Business + Operational owners), versioned updates, and access via Python APIs, self-service UI, Amundsen catalog, and AI agents, providing a single source of truth that propagates changes automatically.
ClickHouse Ingestion at Scale: An Open-Source Zepto Engineering Story (8 minute read)
Zepto improved high-scale ClickHouse ingestion by optimizing the open-source Kafka Connect connector. The team rewrote key internals, lifted throughput by 45%, removed severe GC pauses, added smarter batching, and contributed two major fixes upstream.
DuckDB's agent moment (55 minute podcast)
MotherDuck argues DuckDB's local-first, single-node design fits agents that spin up isolated environments, branch workloads, and run analytical queries. It reports roughly 3 ms median latency and claims a $2.40/hour instance beats a $64/hour Snowflake 2XL by about 5x on ClickBench. The vision is agent swarms handling profiling, quality evals, context, anomalies, and lineage.
Review of Databricks Data + AI Summit 2026 (14 minute read)
Databricks' big 2026 announcements are about simplifying data architecture: Lakehouse//RT aims to serve real-time apps and dashboards directly from the lakehouse, while LTAP tries to unify transactional and analytical workloads on one governed copy of data instead of relying on separate databases, CDC, ETL, and serving layers.
7 Crucial Barriers between Data Teams and Self-Healing Data Architecture (9 minute read)
Genie Ops points toward self-healing pipelines, but context, governance, and interoperability remain hard. Teams need clearer credential management, event orchestration, agent control, and “git for data” patterns like cloning, rollback, and sandboxed edits. Without standards, it gets messy.
Announcing DuckDB 1.5.4 (3 minute read)
DuckDB 1.5.4 is a patch release focused on bug fixes, security hardening, and performance improvements across areas like VARIANT handling, MERGE INTO, JSON, Parquet, Arrow, gzip, and CLI behaviour.
AWS enters the context layer race with a graph that learns from agents, not manual curation (3 minute read)
AWS announced a context stack for AI agents: AWS Context, S3 Annotations, and skill assets in Glue Data Catalog. AWS Context builds and improves a knowledge graph from enterprise data, rules, and domain knowledge, with IAM and Lake Formation access control. Metadata sits in Iceberg on S3 Tables and is exposed through Athena, Redshift, Spark, and MCP tools.
Data-Juicer: The Data Operating System for the Foundation Model Era (Tool)
A Ray-native framework delivers 200+ composable data-curation operators for cleaning, deduplicating, synthesizing, and analyzing text, image, audio, video, and multimodal training data. Teams define reusable pipelines as YAML recipes, combine operators into custom workflows, and scale execution from local machines to large distributed clusters.
Ten years of ClickHouse in open source (17 minute read)
ClickHouse is marking ten years since going open source, reflecting on how it grew from internal web analytics experiments into a major open source analytical database built from scratch for real-time, high-volume workloads and open contribution.
Data quality traffic lights (13 minute read)
Nordnet added a real-time Data Quality Health Badge in Looker, showing green, yellow, or red dashboard trust signals. It catches dbt failures, silent crashes, freshness issues, and volume anomalies, consolidates repeat alerts, and uses dbt plus Looker lineage to show blast radius.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 570,000 readers for
one daily email