TLDR Data 2025-08-21

Semantic Layers Matter 🗂️, Grab Decentralizes Data Ownership ✅, Databricks Goes Real-Time ⚡

📱

Deep Dives

Data mesh at Grab part I: Building trust through certification (7 minute read)

Grab rearchitected its enterprise data infrastructure by implementing a data mesh, achieving 75% of queries on certified assets and significantly reduced data redundancy. Key enablers to decentralizing data ownership included formalizing data contracts, automated incident tracking, and emphasizing data certification. This shift has accelerated cross-domain data reuse, improved reliability, and streamlined data governance.

Building a High Recall Vector Database Serving 1 billion Embeddings From a Single Machine (11 minute read)

CoreNN enhances DiskANN and FreshDiskANN for higher recall with fewer traversals, using RocksDB for persistent storage, arbitrary keys, and seamless scaling. It has two stages, in-memory for speed/accuracy and on-disk with 32x product quantization compression, with support for concurrent operations via backedge deltas and synchronous programming.

Inside ClickHouse Full-text Search: Fast, Native, and Columnar (25 minute read)

ClickHouse has overhauled its full-text search engine to be fully native, leaner, and blazing fast by leveraging inverted indexes, FST-based dictionaries, roaring bitmaps, smarter tokenization, and direct index-driven row filtering. Queries are up to 10x faster since the index filters directly at row level instead of granule level, returning only matching IDs without row-by-row text scans.

Kafka to Iceberg - Exploring the Options (13 minute read)

For moving data from Apache Kafka to Apache Iceberg, Flink SQL offers advanced stateful and stateless processing with extensive flexibility, but requires manual schema management and custom table maintenance. Kafka Connect provides robust integration and schema evolution with a rich connector ecosystem, excelling as a "dumb pipe" with stateless Single Message Transforms, but lacks native support for UPSERT/overwrite operations. Confluent Tableflow, a managed service ideal for SaaS users prioritizing ease and operational efficiency, simplifies setup, provides built-in table maintenance, supports automatic schema evolution, and integrates with managed Flink for preprocessing.

🚀

Opinions & Advice

LLM Evaluation: Practical Tips at Booking.com (11 minute read)

Booking.com shares practical insights on evaluating Large Language Models (LLMs) in production environments, primarily through the "LLM-as-a-judge" framework. Key steps include creating a "golden dataset" with expert-labeled high-quality data for benchmarking, and developing the judge-LLM through prompting or fine-tuning to match golden labels accurately.

No More Excuses for Stream/Table Duality (2 minute read)

Aiven has released the first free, open-source implementation of Apache Iceberg support for Apache Kafka topics, enabling direct log-segment conversion to Parquet and seamless integration with object storage without data copying. While not fully production-ready due to schema evolution limitations, this release eliminates previous reliance on proprietary solutions from Confluent and Redpanda and significantly advances open Lakehouse architectures for Kafka-based streaming pipelines.

5 Things in Data Engineering That Still Hold True After 10 Years (9 minute read)

While new tools and AI have influenced data engineering, the fundamentals remain: good data modeling still requires deliberate choices, data quality issues persist (“garbage in, garbage out”), and dashboards and queries often slow down without proper indexing or aggregation. Despite growth in streaming, batch processing is still the backbone of most platforms, and above all, business alignment matters more than tools.

💻

Launches & Tools

Lance (GitHub Repo)

Lance is a modern columnar data format optimized for ML and LLM workloads implemented in Rust. It delivers lightning-fast random access (up to 100 times faster than Parquet) alongside features like vector search, zero-cost schema evolution, versioning, and rich secondary indices. Lance is tightly integrated with PyArrow-compatible tools such as Pandas, DuckDB, Polars, and PyTorch, enabling seamless adoption within existing ecosystems.

Presidio (GitHub Repo)

Presidio is an open-source SDK for detecting and anonymizing PII in text and images. It offers customizable modules using NER, regex, and external models, with support for Python, PySpark, Docker, and Kubernetes to enable flexible privacy-preserving workflows.

Introducing Real-Time Mode in Apache Spark™ Structured Streaming (5 minute read)

Spark Structured Streaming now supports real-time mode, enabling millisecond-level processing for low-latency applications like fraud detection and live personalization. It requires only a simple config change, and is available in Public Preview on Databricks with popular source and sink support, including Kafka, Kinesis, and forEach for writing to external systems.

🎁

Miscellaneous

The Pragmatic Engineer 2025 Survey: What's in your tech stack? (15 minute read)

PostgreSQL dominates the databases by far (used by one-third), followed by MySQL, MongoDB, Redis, and 100+ alternatives. Kubernetes, Docker, and Terraform are prevalent for backend infrastructure, while Slack, Confluence, and Figma lead for communication and design. Notably, OpenSearch now rivals Elasticsearch due to AWS' backing, but open-source forks of Redis and Terraform see minimal traction. JIRA, VS Code, and AWS are the top tools mentioned overall.

Why Semantic Layers Matter — And How to Build One with Duckdb (22 minute read)

This post presents an overview of the importance of semantic layers in data engineering, highlighting their role in providing a unified definition of business metrics across various tools, which simplifies data governance and reduces duplication. It offers a practical guide to building a basic semantic layer using YAML and Python with DuckDB, targeting scenarios where a semantic layer is beneficial, such as when multiple analytics consumers and complex business logic are involved.

⚡️

Quick Links

Lessons Learned From Building a Sync-engine and Reactivity System with SQLite (5 minute read)

This author built a sync engine and reactive system for a local-first encrypted notes app, switching from PostgreSQL/Electric to SQLite for simplicity and stability, with lightweight sync via JSON requests and frequent polling.

Spotify Data Tech Stack (4 minute read)

Spotify handles 1.4T+ daily events on GCP with 38K+ pipelines, using BigQuery, Dataflow, and Flyte to power batch and streaming analytics for 670M users and personalized experiences.

Curated deep dives, tools and trends in big data, data science and data engineering 📊

Join 400,000 readers for one daily email

Privacy Careers Advertise