TLDR Data 2026-06-29
Snowflake vs Databricks’ Reyden 🥊, Coding Agent Discipline 🛠️, Apache Flink 2.3.0 🌊
How we built SmithDB's inverted index for full-text search (11 minute read)
SmithDB builds inverted indexes with efficient JSON parsing, tokenization, string interning, and radix sorting; interning lifted construction speed by ~2.2x. Streaming compaction bounds memory regardless of index size, while aligned chunks and request coalescing reduce object-storage GETs. Queries merge local-SSD indexes with object-storage segments for sub-second freshness.
12TB of AI Coding Agent Logs (17 minute video)
AI coding is shifting from token maxing to token efficiency as teams move from subscriptions to per-token billing and costs become harder to control. Better workflows rely on careful upfront planning, right-sized agent sessions, cleaner context, API-first tooling, strong CI, and focused human review.
Automated Schema Evolution in Pinterest's Next-Generation DB Ingestion Framework (11 minute read)
Pinterest built schema evolution for CDC across Kafka, Flink, Spark, and Iceberg, treating schema as a contract. Source schemas and sink mappings generate Flink/Spark/Iceberg artifacts, while push- and pull-based checks detect drift. Changes roll out with PR auditability, SLA-based recovery, and backfill fallbacks.
Turning Scattered Data Into Queryable Segments at Scale: How Razorpay Built Its Customer Data Platform (11 minute read)
Razorpay built an in-house Customer Data Platform to turn scattered transaction data across 500M+ user profiles into real-time, queryable audience segments, using Airflow DAGs + Spark for daily segment computation (with reuse and deduplication), Temporal workflows for reliable DynamoDB ingestion with zero-downtime versioning, and privacy-preserving hashed lookups.
Why Real Workload Performance is the Metric that Matters (7 minute read)
Real workload performance matters more than headline benchmarks because production systems need to handle real data, concurrency, latency, scale, and cost. Performance claims should be judged by whether the workload matches yours, the setup is production-ready, results hold as data grows, and the product is actually available.
Building My Own Self-Hosted dbt Cloud (6 minute read)
A self-hosted dbt Cloud-style app can deliver much of the developer experience by combining dbt Core with a React/FastAPI interface and Prefect for orchestration. The biggest lesson is to use APIs, not CLI scraping, for reliable job management, logs, deployments, and real-time run status.
AI lacks real-world context - and it's costing business trillions annually (Sponsor)
Apache Flink 2.3.0 Release Announcement (8 minute read)
Flink 2.3 moves toward a declarative streaming data platform. Materialized tables can evolve through DDL and query changes while avoiding unnecessary historical reprocessing in many common cases. SQL adds changelog conversion, explicit upsert conflict handling, and native S3 support without Hadoop dependencies.
Kafka Share Groups - Pathological Fetch Waits with Record_limit (13 minute read)
A notable performance pitfall in Kafka Share Groups arises when using record_limit with fewer consumers than partitions, especially under partition skew. This leads to pathological fetch waits, which can drastically slow consumption during backlog drains or skewed workloads. The simplest mitigation is to use at least as many consumers as partitions when running with record_limit.
Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM (9 minute read)
Hardwood 1.0 is a production-ready, JVM-native Parquet reader for Java 21+ that removes mandatory dependencies and parallelizes page decoding across CPU cores by default. It covers Parquet physical/logical types, projections, predicate push-down, local and object-store files, with row and batch column APIs. Benchmarks show 16.5M rows/sec and ~17-18x selective push-down speedups.
14x faster embeddings: how we rebuilt the ONNX path in Manticore (9 minute read)
Manticore rewrote its embedding pipeline on ONNX Runtime, slashing CPU waste and lifting throughput up to 14x for low-latency vector search. The design shares one thread-safe ONNX session, disables intra-op spinning, and processes documents individually to avoid lock contention and variable-length padding overhead.
We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs. The Answer Is More Complicated Than We Expected (6 minute read)
Wix's 250-run evaluation found agent-optimized docs improved CLI task completion from 67% to 87%, cut token use 35%, and beat skills-only runs when skills were stale or misaligned. For API tasks, both hit 80% completion, but docs ran 31% faster while skills used 29% fewer tokens. Use optimized docs as the foundation, with skills as an evaluated caching layer.
How we used DSPy to turn AI evaluations into better responses in Dash chat (5 minute read)
Dropbox used DSPy to turn AI evaluations into concrete Dash Chat improvements, combining LLM-as-judge evals, human-labeled examples, offline replay, and statistical validation. The result was fewer incomplete answers, better coverage of user intent, and lower token use without compromising answer quality.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 570,000 readers for
one daily email