TLDR Data 2025-06-26
New LLM Data Stack π₯, Why Spark Feels Slow π, MCP in Minutes βΒ
Why Apache Spark is Often Considered as Slow? (21 minute read)
OSS Vanilla Spark is a versatile, hybrid distributed query engine that supports diverse workloads, including OLAP on data warehouses and semi-structured data processing. However, its unified design makes it slower than specialized vectorized engines like Trino or Snowflake for OLAP queries on columnar data. Through the Catalyst Extensions API, solutions like Databricks Photon, Apache Gluten, and Apache Datafusion Comet transform Spark into a pure vectorized engine by replacing code generation with vectorized execution, leveraging frameworks like VeloxDB, ClickHouse, or Datafusion for improved performance.
Hands-on with Apache Iceberg (71 minute video)
Apache Iceberg is a metadata-driven table format that lets data teams write once to object storage and query the same data from engines such as Trino, Polars, DuckDB, or Spark while supporting hidden partitioning, snapshot time travel, schema evolution, and branch-style rollbacks. A UK housing demo illustrates these features in practice by adding columns without rewrites, switching partitions, cleaning small files, tagging snapshots, and streaming Kafka events straight into Iceberg, delivering zero-copy analytics and simpler pipeline maintenance.
Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines (8 minute read)
Pinterest achieved a 10x time reduction in ML iteration by integrating Ray across the entire ML infrastructure stack (from feature engineering and sampling to labeling) beyond training/batch inference workflows. The company optimized data ingestion with Iceberg write mechanisms and achieved 90% of roofline throughput in the homefeed ranking model training pipeline.
How Skroutz Handles Real-time Schema Evolution in Amazon Redshift with Debezium (7 minute read)
Skroutz streamlined real-time schema evolution for its Redshift data warehouse using Debezium CDC with Kafka, capturing schema changes from MariaDB and MongoDB without blocking development or risking data loss. Schema additions are logged in a SUPER-typed JSON column and surfaced for manual intervention via automated alerts, minimizing coordination requirements.
The Hidden Cost of Over-instrumentation: Why More Tracking Can Hurt Product Teams (4 minute read)
The pitfalls of over-instrumentation are where excessive tracking of user actions leads to metric sprawl, wasting engineering effort, and complicating data validation. Instead, teams should treat instrumentation as a product by defining clear purposes for tracked metrics, creating documentation for key events, conducting quarterly cleanup of unused events, and involving analysts early in the design phase to ensure actionable data.
Data Integrity vs Data Security: Why You Need Both (3 minute read)
Data integrity ensures data remains accurate, consistent, and trustworthy throughout its lifecycle, while data security focuses on protecting data from unauthorized access, breaches, or leaks. Data integrity employs techniques like referential integrity checks, data validation, and observability tools to prevent corruption or errors, whereas data security uses encryption, access controls, and threat detection to safeguard data.
Find out what all the ducking fuss is about with the free DuckDB ebook (Sponsor)
Discover why DuckDB is the go-to analytics database for startups that need to get to market fast. The DuckDB in Action ebook provides practical examples and expert tips on running efficient queries locally and seamlessly scaling to the cloud with MotherDuck.
Get your free copy and
try it out with a free trial of MotherDuck.
Schema In, Data Out: A Smarter Way to Mock (4 minute read)
MockingJar is an open-source tool designed to generate realistic, schema-valid mock data instantlyβcrucial for speeding prototyping, testing, and data flow validation across software stacks. It features a visual schema builder, robust validation, and an AI-powered generator (Claude Sonnet 4). MockingJar supports complex structures, embedded schemas, and custom constraints. It is usable as both a web app and a TypeScript package.
Introducing Northguard and Xinfra: Scalable Log Storage at LinkedIn (12 minute read)
LinkedIn replaced its sprawling Kafka estate with Northguard, a sharded log store that stripes data and metadata for automatic balance, stronger consistency, and massive scale. Xinfra provides a virtual pub-sub layer so apps keep the same API during seamless, federated migrations across clusters.
Langfuse and ClickHouse: A New Data Stack for Modern LLM Applications (8 minute read)
Langfuse transitioned from a Postgres-based architecture to ClickHouse with a Redis for cache, S3 for storing large payloads, and an async event processor to manage high-ingestion workloads. It enables efficient real-time analytics and seamless migration for over 1,000 self-hosted users without downtime by leveraging ClickHouse's columnar architecture.
New With Confluent Platform 8.0: Stream Securely, Monitor Easily, and Scale Endlessly (9 minute read)
Confluent Platform 8.0, built on Apache Kafka 4.0, is now available. It eliminates ZooKeeper by introducing KRaft for simplified architecture, instant failover, and support for clusters with millions of partitions. The release features a revamped, Prometheus-powered Control Center with observability for Kafka and Flink, delivers client-side field-level encryption for sensitive data, and introduces passwordless authentication.
Data federation: Understanding What It Is and How It Works (8 minute read)
Data federation enables businesses to access and integrate data from multiple sources in real time without physically moving or duplicating it, thereby maintaining data integrity and reducing storage costs. It uses a virtual database to provide a unified view of disparate data sources, allowing seamless querying and analysis while keeping data in its original location.
Google Donates the Agent2Agent Protocol to the Linux Foundation (3 minute read)
Google's Agent2Agent (A2A) protocol, designed to streamline interoperability between AI agents across different frameworks, has been donated to the Linux Foundation. The initiative, joined by AWS, Cisco, Salesforce, SAP, and ServiceNow, emphasizes open standards, vendor neutrality, and community-driven tooling (SDKs and NPM packages) to accelerate cross-agent collaboration at both enterprise and developer scales.
Plane Tracking with Apache Flink (GitHub Repo)
This demo project uses Apache Flink's SQL and complex event processing to analyze real-time ADS-B aviation feeds for critical flight moments such as missed landing approaches and paired runway landings. The project provides Docker-Compose setups for Kafka, Flink, Connect, and Schema Registry, Python scripts to ingest live data from ADSB.fi, and FlinkSQL definitions to detect and enrich flight events.
FastMCP 2.0 (GitHub Repo)
FastMCP 2.0 is a Python framework that lets you stand up Model Context Protocol servers or clients with a few decorators, exposing Resources, Tools, and Prompts to LLMs in minutes while adding production features like auth, deployment helpers, proxying, OpenAPI/FastAPI import, and in-memory testing so data teams can plug existing APIs into agent workflows without boilerplate.
Event-driven Scheduling of Airflow DAGs with Apache Kafka (GitHub Repo)
Astro CLI template links Kafka to Airflow 3's event-driven scheduler: a producer DAG sends a message and an event-driven DAG fires instantly without cron or sensors.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email