TLDR Data 2025-05-19
Better SQL with AI 🤖, Multimodal data querying future 🔮, Flink CDC Updates 💾
Getting AI to Write Good SQL: Text-to-SQL Techniques Explained (8 minute read)
High-quality text-to-SQL requires more than raw LLM power—it needs domain context, intent disambiguation, and dialect precision. Systems provide LLMs with the necessary business context by retrieving relevant schemas, sampling real data, and adding semantic annotations. Ambiguous queries are clarified through follow-up questions, while SQL-aware models and validation loops ensure accuracy. Continuous evaluation using benchmarks and user feedback drives improvements in accuracy and reliability.
Turning Data Into Insight: Flexible Lakehouse with MinIO, Iceberg, Airflow, dbt, Spark, Pandera, & Superset (17 minute read)
This article outlines the steps to build a modern data lakehouse using Apache Iceberg, Apache Spark, MinIO, Airflow, dbt, and Apache Superset. The guide covers the setup of the necessary infrastructure using Docker, integrating Spark with Iceberg for efficient data storage, and using dbt for data transformation and insight generation via Superset. The focus is on leveraging open-source tools to create a scalable, high-performance, and cost-effective data processing pipeline.
DuckDB + PyIceberg + Lambda (8 minute read)
You can prototype a serverless lakehouse by combining DuckDB's in-process analytics, PyIceberg's table management, and AWS Lambda's event-driven compute. Incoming S3 files trigger Lambda functions to register data in Iceberg, while DuckDB handles downstream queries using Iceberg's metadata for partition pruning and schema evolution. This architecture enables sub-second analytics, ACID guarantees, and zero-server maintenance, making it ideal for cost-effective, scalable data platforms.
Handling GTFS Data with DuckDB (8 minute read)
A hands-on guide to using DuckDB to manage and query GTFS (General Transit Feed Specification) data, which is a standardized format for public transportation schedules. The process includes creating a DuckDB database, importing GTFS data from various sources, querying it efficiently, and exporting it to storage-optimized formats like Parquet. The use of DuckDB allows for easy integration, error handling, and compression of large datasets, making it a valuable tool for data-driven organizations.
"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing (3 minute read)
Batch and stream processing aren't mutually exclusive. Modern streaming systems often use batching techniques like SIMD processing and Apache Arrow for optimized throughput. The key architectural difference lies in "push" (real-time streaming) versus "pull" (interval-based batch querying). Streaming offers immediate access to changing data, while batch may risk staleness. Storage/compute separation improves streaming efficiency, but managing state and out-of-order data remains complex. Combining both approaches delivers agility, minimal latency, and real-time insights without losing efficiency.
Building AI Agents? A2A vs. MCP Explained Simply (4 minute read)
Agent-to-Agent (A2A) and Model Context Protocol (MCP) are complementary AI frameworks. A2A, introduced by Google in 2025, standardizes communication between agents using JSON-RPC 2.0 and Agent Cards, ideal for multi-agent collaboration. MCP, launched by Anthropic in 2024, connects agents to external tools and data via a client-server model, simplifying integrations for tasks like querying APIs. A2A focuses on agent collaboration, while MCP enhances individual agent capabilities, together forming a scalable ecosystem for autonomous AI workflows.
We Need a New…Database? (12 minute read)
Cloud databases like Databricks and Snowflake excel in data centralization and scalability, but many enterprise teams still struggle to provide real business value. AI-driven tools like Notion's Enterprise Search and Granola 2.0 challenge traditional BI by enabling direct querying of unstructured data, offering more nuanced insights. This shift points to a future where databases natively support multimodal, semantically linked data, unlocking new possibilities for data engineers and rethinking what data is truly valuable.
Apache Flink CDC 3.4.0 Release Announcement (3 minute read)
Flink CDC 3.4.0 introduces support for Flink 1.19.x and 1.20.x, a new Apache Iceberg pipeline connector, and batch execution mode for full data sync without incremental updates. Key enhancements include improved MySQL, Paimon, and MongoDB connectors, optimized schema evolution processing, and transform framework upgrades. Application mode YARN support and multiple bug fixes further streamline data pipeline reliability.
Doctor (GitHub Repo)
Doctor is a system that allows LLM agents to discover, crawl, and index web pages to improve reasoning and code generation. It uses a stack that includes DuckDB for storing embeddings, Redis for message brokering, and FastAPI for web services. The system can be integrated with editors via an MCP server, and its components include web crawling, chunking, and search functionality.
So You Think You Want to Quit Your Job? (7 minute read)
Consulting in data engineering offers significant career leverage by exposing professionals to diverse challenges, industries, and leadership opportunities—far beyond what most in-house roles provide. Building a robust network and demonstrating business impact is crucial for landing high-value projects, often ranging from $10K to $100K+. Success depends on patience, continuous relationship-building, and emphasizing outcomes over technical details.
Some English Hospitals Doubt Palantir's Utility: We'd “Lose Functionality Rather than Gain it” (3 minute read)
Approximately one-third of NHS England trusts have gone live with Palantir's £330 million Federated Data Platform, but hospitals are reporting concerns over its functionality and practical utility. Key issues include difficulty integrating with existing systems and insufficient support for critical workflows, raising questions about operational readiness and value delivery. Data platform stakeholders should closely monitor ongoing feedback and demand robust interoperability and support features to maximize the investment's impact.
AI Agents Unite: Conference Reveals Next-Gen Frameworks (7 minute read)
Key advancements in multi-agent frameworks and data infrastructure were showcased at the AI Agent Conference in NYC. Tools like AG2 and Arklex were unveiled for scalable agent orchestration, while innovations such as Boundary's BAML and Timescale's text-to-SQL focus on LLM integration, reliability, and reducing AI hallucinations. LanceDB introduced unified storage for multimodal workloads, improving vector management and training pipelines.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email