TLDR Data 2026-07-02
Google’s Tabular Foundation Model 🧾, Meta’s Data Eng Agent 🛠️, LLM Spark Debugger 🚦
Using LLMs to Analyze Spark SQL Plans: A Practical Approach to Debugging Long-Running Jobs (8 minute read)
Instead of manually parsing complex physical plans and DAGs for debugging long-running Spark SQL jobs, Expedia feeds the plans (along with relevant context) by using LLMs to analyze and explain Spark execution plans to quickly identify bottlenecks, inefficient joins, skewed data, or suboptimal operators, significantly speeding up troubleshooting for production Spark workloads.
Ontology Everywhere! (8 minute read)
Ontologies are re-emerging as a practical data platform layer because AI agents need explicit business meaning, not just schemas or dashboards. Unlike data models, ontologies encode shared concepts, typed relationships, constraints, and limited inference. In enterprise tools, this often appears as typed-edge traversal, semantic layers, or knowledge graphs. High-value deployments still require human-curated semantics, especially where systems can write back and act on decisions.
How We Built DEmate: Taming LLMs for Data Engineering at Meta (7 minute read)
Meta's DeMate is an internal LLM-powered assistant for data engineers that helps with writing SQL, generating pipelines, reviewing code, and understanding complex data flows at massive scale. The architecture combines RAG over internal data catalogs, schema documentation, and code repositories with carefully engineered prompts, multi-step reasoning chains, and human-in-the-loop feedback loops for evaluation and continuous improvement.
Building Indexes on a Moving Target (20 minute read)
Apache Hudi explores the challenges and solutions for building and maintaining indexes on continuously updating datasets (a "moving target") using different indexing strategies, from simple bloom filters to more advanced approaches while handling the trade-offs between index freshness, query performance, write overhead, and scalability in large-scale data lakes with high-velocity updates.
Never seen a data quality issue that wasn't actually an ownership problem (4 minute read)
Data quality failures are usually ownership failures: when multiple teams consume the same metric but no single person controls its definition, calculation, and change process, trust erodes and fixes stay temporary. The practical remedy is explicit metric governance: one named owner, clear decision rights, version/change control, and enforceable quality rules tied to the metric.
Query Faster, Query Smarter: Our Move to DuckDB and What We Learned (4 minute read)
Arcesium migrated thousands of SQL queries from Athena to Trino to DuckDB over 18 months, cutting query costs by ~50% and reducing query runtime by ~50% for small-to-medium workloads. Athena hit account/service limits, while Trino solved scalability but increased resource cost. DuckDB delivered the needed speed with ~40% lower memory footprint. The migration required handling Glue-less schema evolution, Parquet compaction, JSON fallbacks for STRUCT mismatches, and thread parallelism tuning.
Too many tables are bad for you (6 minute read)
Having too many tables in PostgreSQL is a bad idea and can seriously hurt performance. The hidden costs can come from bloated catalogs and slower query planning to increased I/O. Practical guidance includes consolidating small, related tables, avoiding excessive schema-per-tenant patterns, monitoring catalog size and planning time, and using inheritance or declarative partitioning when appropriate.
Introducing TabFM: A zero-shot foundation model for tabular data (4 minute read)
TabFM is a foundation model for tabular classification and regression that reframes prediction as in-context learning, removing per-dataset training, hyperparameter tuning, and feature engineering. It was trained on hundreds of millions of synthetic datasets generated with structural causal models and benchmarked on TabArena across 38 classification and 13 regression datasets (700 to 150,000 rows), where it outperformed heavily tuned tree-based baselines.
SedonaDB 0.4: GPU-Accelerated Spatial Joins (3 minute read)
SedonaDB 0.4 adds RayBooster, a GPU spatial join engine that uses NVIDIA ray tracing cores to accelerate geometry intersection queries. It delivers up to ~5.9x faster joins, lower AWS costs, and in some cases lets a consumer RTX 3090 beat an H100 on spatial workloads.
TiDB (GitHub Repo)
TiDB is an open-source, cloud-native distributed SQL database with strong consistency, MySQL compatibility, horizontal scaling, high availability, and hybrid transactional/analytical processing. It separates compute and storage, supports TiKV and TiFlash, and is positioned for workloads needing transactions, analytics, vector search, and scalable infrastructure.
How To Corrupt An SQLite Database File (14 minute read)
SQLite is highly resistant to corruption, but can still be damaged by unsafe file access, bad backups, missing journals, broken locking, failed syncs, faulty storage, memory bugs, or risky PRAGMA settings.
Data Residency Is Not a Legal Problem. It Is An Infrastructure Design Problem (5 minute read)
Data residency is an infrastructure problem, not a storage-only policy issue: regulated workloads must account for where data is stored, processed, logged, backed up, accessed, and where ML experiments run. Region parity gaps in managed services can force cross-region workarounds or delayed migrations, so teams need region-aware platforms with reproducible CI/CD, RBAC, audit logs, local backups, and portable compute.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 570,000 readers for
one daily email