New LLM Data Stack 🥞, Why Spark Feels Slow 🐌, MCP in Minutes ⌚

Why Apache Spark is Often Considered as Slow? (21 minute read)

OSS Vanilla Spark is a versatile, hybrid distributed query engine that supports diverse workloads, including OLAP on data warehouses and semi-structured data processing. However, its unified design makes it slower than specialized vectorized engines like Trino or Snowflake for OLAP queries on columnar data. Through the Catalyst Extensions API, solutions like Databricks Photon, Apache Gluten, and Apache Datafusion Comet transform Spark into a pure vectorized engine by replacing code generation with vectorized execution, leveraging frameworks like VeloxDB, ClickHouse, or Datafusion for improved performance.

TLDR Data 2025-06-26

New LLM Data Stack 🥞, Why Spark Feels Slow 🐌, MCP in Minutes ⌚

Deep Dives

Why Apache Spark is Often Considered as Slow? (21 minute read)

Hands-on with Apache Iceberg (71 minute video)

Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines (8 minute read)

How Skroutz Handles Real-time Schema Evolution in Amazon Redshift with Debezium (7 minute read)

Opinions & Advice

The Hidden Cost of Over-instrumentation: Why More Tracking Can Hurt Product Teams (4 minute read)

Data Integrity vs Data Security: Why You Need Both (3 minute read)

Launches & Tools

Find out what all the ducking fuss is about with the free DuckDB ebook (Sponsor)

Schema In, Data Out: A Smarter Way to Mock (4 minute read)

Introducing Northguard and Xinfra: Scalable Log Storage at LinkedIn (12 minute read)

Langfuse and ClickHouse: A New Data Stack for Modern LLM Applications (8 minute read)

New With Confluent Platform 8.0: Stream Securely, Monitor Easily, and Scale Endlessly (9 minute read)

Miscellaneous

Data federation: Understanding What It Is and How It Works (8 minute read)

Google Donates the Agent2Agent Protocol to the Linux Foundation (3 minute read)

Plane Tracking with Apache Flink (GitHub Repo)

Quick Links

FastMCP 2.0 (GitHub Repo)

Event-driven Scheduling of Airflow DAGs with Apache Kafka (GitHub Repo)

Curated deep dives, tools and trends in big data, data science and data engineering 📊