TLDR Data 2025-06-30
Self-correcting AI SQL 📝, Blazing-fast JOINs 🏎️, SQLRooms Local-first Analytics 🐤
Join Me if You Can: ClickHouse vs. Databricks & Snowflake - Part 2 (11 minute read)
In a benchmark comparing JOIN-heavy SQL queries, ClickHouse Cloud outperformed Databricks and Snowflake by achieving up to 6.6× faster query times and over 60% cost savings through the use of in-memory dictionaries instead of traditional joins with minimal tuning.
MUVERA: Making multi-vector retrieval as fast as single-vector search (5 minute read)
Google's MUVERA algorithm transforms complex multi-vector retrieval into single-vector maximum inner product search, enabling faster and more efficient information retrieval. By converting multi-vector embeddings into fixed dimensional encodings (FDEs), MUVERA achieves comparable recall to state-of-the-art methods while reducing latency by 90% and retrieving 2-5× fewer candidates across diverse BEIR datasets.
Discovering DuckDB Use Cases via GitHub (8 minute read)
Utilize the GitHub API to find repositories referencing DuckDB, then harness DuckDB's SQL features to effectively parse and analyze the data. This method streamlines insights into DuckDB's adoption across GitHub projects.
I Made Cursor + AI Write Perfect SQL. Here's the Exact Setup (17 minute read)
AI-generated SQL is often brittle, but pairing Cursor with DuckDB and MotherDuck enables self-correcting workflows where the AI runs, debugs, and iterates on SQL locally using real schema context. This closed-loop setup isolates workloads, avoids production risk, and dramatically reduces manual debugging for data teams.
Lakebase: Databricks' Bold Play to Fuse OLTP and the Lakehouse (3 minute read)
Databricks' Lakebase, a managed PostgreSQL database, is pricey but could save costs by simplifying data pipelines and reducing complexity for real-time analytics. It's a potential game-changer for Databricks users by unifying OLTP and OLAP, but others should evaluate its performance and lock-in risks. Real-world benchmarks will determine if it lives up to the hype.
You've Got 99 Problems but Data Shouldn't Be One (29 minute podcast)
Tobiko Data is creating a new standard in data transformation with its products, SQLGlot and SQLMesh. SQLGlot, a Python-based parser and transpiler, converts SQL into abstract syntax trees (ASTs) to analyze metadata or translate SQL across different dialects. SQLMesh is a data transformation and orchestration framework that uses SQLGlot's parsing for semantic understanding, enabling unit testing and supporting warehouse migrations.
Stop Building AI Agents (9 minute read)
Building fully autonomous AI agents is often unnecessary and leads to brittle, complex systems that are hard to debug. Most workflows benefit from simpler patterns like chaining, routing, or orchestrator-worker structures. Direct API calls and clear, sequential logic outperform agents in typical enterprise settings while agents only excel when paired with continuous human oversight for dynamic, creative tasks.
What's Driving Fluent Bit Adoption? (7 minute read)
Fluent Bit's rapid global adoption, with over 15 billion downloads and integration with all major cloud platforms, stems from its exceptional scalability, resource efficiency, and seamless compatibility with Fluentd in hybrid environments. Its native support for OpenTelemetry Protocol, robust multithreading, and plugin extensibility enable unified, low-overhead collection of logs, metrics, and traces across diverse architectures.
videoparquet (GitHub Repo)
videoparquet lets you convert Parquet files into video formats and back using ffmpeg, enabling efficient compression and storage of tabular or array-like data. It supports both lossless and lossy codecs, retains schema metadata, and includes tooling for roundtrip validation and reproducibility.
Foursquare Introduces SQLRooms (8 minute read)
SQLRooms is an open-source framework for building analytics apps that run entirely in the browser or on a laptop with no backend infrastructure. It combines DuckDB for in-process analytics, local AI via Ollama, and a React-based UI toolkit to create fast, private, and fully offline-capable data apps using modern browser technologies.
Pipelining AI/ML Training Workloads with CUDA Streams (7 minute read)
Leveraging CUDA streams for pipelining AI/ML training workloads in PyTorch can significantly accelerate compute throughput by enabling concurrent data transfers and computation on GPUs. This technique helps minimize idle times by overlapping forward, backward, and optimizer steps, resulting in up to 20-30% faster training times depending on model size and data pipeline complexity.
Grafana's CTO on the State of the Observability Market (8 minute read)
Observability is rapidly consolidating, with large enterprises reducing tool sprawl (often moving from many providers to just two or three) driven by OpenTelemetry's standardization and a sharp focus on cost control and vendor lock-in avoidance. Open source solutions like Prometheus and Loki now commoditize core telemetry functions, while new entrants compete on technological innovation and broader business use cases. The market is shifting toward unified, user-friendly observability systems to foster faster insights and democratized access for technical and business stakeholders.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email