Snowflake vs Databricks’ Reyden 🥊, Coding Agent Discipline 🛠️, Apache Flink 2.3.0 🌊

How we built SmithDB's inverted index for full-text search (11 minute read)

SmithDB builds inverted indexes with efficient JSON parsing, tokenization, string interning, and radix sorting; interning lifted construction speed by ~2.2x. Streaming compaction bounds memory regardless of index size, while aligned chunks and request coalescing reduce object-storage GETs. Queries merge local-SSD indexes with object-storage segments for sub-second freshness.

TLDR Data 2026-06-29

Snowflake vs Databricks’ Reyden 🥊, Coding Agent Discipline 🛠️, Apache Flink 2.3.0 🌊

Deep Dives

How we built SmithDB's inverted index for full-text search (11 minute read)

12TB of AI Coding Agent Logs (17 minute video)

Automated Schema Evolution in Pinterest's Next-Generation DB Ingestion Framework (11 minute read)

Turning Scattered Data Into Queryable Segments at Scale: How Razorpay Built Its Customer Data Platform (11 minute read)

Opinions & Advice

Why Real Workload Performance is the Metric that Matters (7 minute read)

Building My Own Self-Hosted dbt Cloud (6 minute read)

Launches & Tools

AI lacks real-world context - and it's costing business trillions annually (Sponsor)

Apache Flink 2.3.0 Release Announcement (8 minute read)

Kafka Share Groups - Pathological Fetch Waits with Record_limit (13 minute read)

Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM (9 minute read)

Miscellaneous

14x faster embeddings: how we rebuilt the ONNX path in Manticore (9 minute read)

We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs. The Answer Is More Complicated Than We Expected (6 minute read)

How we used DSPy to turn AI evaluations into better responses in Dash chat (5 minute read)

Quick Links

Gemma Interactions View (5 minute read)

Host- and Domain-Level Web Graphs April, May, and June 2026 (3 minute read)

Curated deep dives, tools and trends in big data, data science and data engineering 📊