TLDR Data 2025-07-07
Inside Reddit’s Architecture 🏛️, Lightweight Semantic Layer 🪶, Iceberg Spec Issues ⚠️
Boring Semantic Layer + MCP = 🔥 (5 minute read)
The Boring Semantic Layer is a lightweight interface that standardizes data access for LLMs, reducing SQL errors and hallucinations by exposing only pre-validated aggregations. This can be integrated with MCP to let LLMs interact with semantic models via pre-defined API endpoints, such as listing, describing, and querying models. This setup streamlines natural-language access to data and allows greater control over semantic model and query design. A complete implementation based on Ibis and FastMCP is available.
Driving Content Delivery Efficiency Through Classifying Cache Misses (11 minute read)
Netflix's Open Connect CDN enhances content delivery efficiency by classifying cache misses to identify why content isn't served from local Open Connect Appliances (OCAs), using a steering service to rank local sites by network proximity. Netflix uses models to forecast content popularity, optimize caching, and reduce costly recomputation.
Atlassian's 4 Million PostgreSQL Database Migration: When Standard Cloud Strategies Fail (12 minute read)
Atlassian migrated 4 million tenant-isolated Jira PostgreSQL databases to Amazon Aurora, overcoming managed service limitations by developing a custom orchestration tool. The multi-month migration utilized AWS Step Functions, feature flags, and a bespoke "draining" process to handle over 27.4 billion files and peak migration rates of 90,000 databases per day, converting 2,403 RDS instances. This effort significantly improved scalability, reliability (99.99% SLA), and cost efficiency.
9 Trends Shaping the Future of Data Management in 2025 (6 minute read)
Modern data management is shaped by nine key trends, with artificial intelligence revolutionizing every layer from data collection to analysis. Data mesh architectures decentralize ownership, empowering cross-functional teams, while AI-driven synthetic data supports compliance and efficiency.
Iceberg, The Right Idea - The Wrong Spec - Part 1 of 2: History (15 minute read)
Iceberg's table format solves vendor lock‑in on paper but inherits the worst parts of Hadoop: it leans on object storage that is slow for tiny, high‑churn metadata and offers weak guarantees for concurrency, atomicity, and fragmentation control. Without a proper storage engine layer, teams will spend more effort wrangling space management and performance than they ever saved on “open” files. Proceed cautiously until the spec addresses these gaps.
How AI is Changing Software Engineering at Shopify with Farhan Thawar (47 minute podcast)
Farhan Thawar, Head of Engineering at Shopify, shares how the company fully embraces AI, even encouraging its use during interviews. As the first organization outside of GitHub to adopt GitHub Copilot and an early internal user of Cursor, Shopify has gained valuable insights throughout its AI journey.
Event-Driven AI Agents: Why Flink Agents Are the Future of Enterprise AI (6 minute read)
Flink Agents, powered by Apache Flink and integrated with platforms like Confluent Cloud, are poised to transform enterprise AI by enabling real-time, event-driven, autonomous systems that overcome the limitations of traditional batch processing. They process continuous data streams with low latency, allowing AI agents to make context-aware decisions instantly.
A Guide to Converting ADK Agents with MCP to the A2A framework (5 minute read)
Google's Agent Development Kit (ADK) enables developers to transform standalone AI agents into collaborative components using the Agent-to-Agent (A2A) protocol. The A2A protocol allows agents to discover each other's capabilities via Agent Cards, interact securely, and delegate tasks to other A2A-compatible agents via "Orchestrator Agent".
RAPIDS Adds GPU Polars Streaming, a Unified GNN API, and Zero-Code ML Speedups (2 minute read)
NVIDIA RAPIDS 25.06 introduces a Polars GPU streaming engine, enabling execution on larger-than-VRAM datasets as well as support for rolling window functions. The release also introduces a unified API for graph neural networks (GNNs), extends the zero-code-change acceleration capabilities to support vector machines from scikit-learn workflows, and provides support for Python 3.13.
DuckLake 0.2 (5 minute read)
DuckLake v0.2 introduces enhanced features like structured file paths with schema and table subdirectories, addressing previous limitations. It adds support for secrets to manage credentials, new functions like ducklake_list_files for better system integration, and flexible settings for Parquet file compression.
Rethinking Data Science Interviews in the Age of AI (11 minute read)
AI-powered tools are rapidly changing data science interviews by automating technical screenings, generating code solutions, and assessing candidates' analytical thinking with greater objectivity. Hiring managers must now focus on evaluating business problem-solving, communication, and real-world data skills beyond automated code tests, while candidates should demonstrate adaptability and contextual understanding.
How Reddit Works 🔥 (15 minute read)
Reddit started with a single-machine Postgres server. As the user base grew, Reddit partitioned its database for scalable writes, introduced job queues to process new posts, votes, and comments asynchronously, and layered denormalized, precomputed lists and comment trees on cache servers and Cassandra for sub-second reads. Vote and comment queues were further sharded by subreddit and managed with Zookeeper locks to eliminate contention. Today, this event-driven architecture delivers responsive, real-time feeds and nested comment views for over 100 million daily users.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email