Better SQL with AI 🤖, Multimodal data querying future 🔮, Flink CDC Updates 💾

"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing (3 minute read)

Batch and stream processing aren't mutually exclusive. Modern streaming systems often use batching techniques like SIMD processing and Apache Arrow for optimized throughput. The key architectural difference lies in "push" (real-time streaming) versus "pull" (interval-based batch querying). Streaming offers immediate access to changing data, while batch may risk staleness. Storage/compute separation improves streaming efficiency, but managing state and out-of-order data remains complex. Combining both approaches delivers agility, minimal latency, and real-time insights without losing efficiency.

TLDR Data 2025-05-19

Better SQL with AI 🤖, Multimodal data querying future 🔮, Flink CDC Updates 💾

Deep Dives

Getting AI to Write Good SQL: Text-to-SQL Techniques Explained (8 minute read)

Turning Data Into Insight: Flexible Lakehouse with MinIO, Iceberg, Airflow, dbt, Spark, Pandera, & Superset (17 minute read)

DuckDB + PyIceberg + Lambda (8 minute read)

Handling GTFS Data with DuckDB (8 minute read)

Opinions & Advice

"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing (3 minute read)

Building AI Agents? A2A vs. MCP Explained Simply (4 minute read)

We Need a New…Database? (12 minute read)

Launches & Tools

Apache Flink CDC 3.4.0 Release Announcement (3 minute read)

Doctor (GitHub Repo)

Miscellaneous

So You Think You Want to Quit Your Job? (7 minute read)

Some English Hospitals Doubt Palantir's Utility: We'd “Lose Functionality Rather than Gain it” (3 minute read)

AI Agents Unite: Conference Reveals Next-Gen Frameworks (7 minute read)

Quick Links

Machine Learning Prototyping with DuckDB and scikit-learn (5 minute read)

Balancing Off-the-Shelf and Custom Solutions in Data Engineering (46 minute podcast)

Curated deep dives, tools and trends in big data, data science and data engineering 📊