TLDR Data 2025-06-05

Snowflake Summit Highlights ❄️, DuckLake in Detail 🦆, Docs-as-Code 📄

📱

Deep Dives

Adopting Docs-as-Code at Pinterest (11 minute read)

Pinterest's docs-as-code approach uses a standardized static site generator to centralize Markdown-based documentation from multiple repositories. The PDocs tool has grown to support 140+ documentation projects from 60+ GitHub repositories. It has become one of the most visited internal sites at Pinterest.

DuckLake w/ Hannes Mühleisen - Practical Data Lunch and Learn (58 minute podcast)

DuckLake shifts Iceberg-style table metadata into any transactional SQL database (DuckDB extension out of the box) while leaving data in Parquet on object storage, so you keep the open-format benefits but lose the JSON/Avro sprawl. The result: fast, multi-table ACID commits, snapshot time-travel, per-file encryption, and no more “small-file” pain; thousands of DuckDB or other compute nodes can read-write concurrently through a lightweight metastore. As the spec is open and format-compatible, you can zero-copy import existing Iceberg or Hive partitions and mix-and-match storage, catalog, and compute to suit cost or performance needs.

In a Snap! Turbo-charging Our Database Backups (8 minute read)

Wealthfront reengineered its database backup strategy by moving from nightly logical backups using mysqldump to near-instant, incremental AWS EBS volume snapshots triggered from cloud-based replicas. This resulted in backup times shrinking from hours to seconds and full database restore times dropping from 5-7 days to a few hours, eliminating failures and replica lag. The shift enabled automated validation, improved resilience, and empowered broader, faster database operations and sandboxing.

🚀

Opinions & Advice

Databricks SQL Scripting (5 minute read)

Databricks has introduced SQL Scripting and stored procedures using SQL/PSM syntax, allowing engineers to encapsulate procedural logic, manage variables, and create reusable pipelines directly in Unity Catalog. While this promises flexibility reminiscent of legacy SQL Server environments, it revives historic challenges: code invisibility, debugging complexity, and the risk of overloading stored procedures with business logic. The key takeaway: leverage this feature judiciously, emphasizing clean logic, source control, and pipeline observability to avoid the pitfalls of “stored procedure hell.”

6x Faster ML Inference: Why Online≫Batch (8 minute read)

Whatnot's transition from batch to online ML inference reduced latency by 5.8x, achieving 120ms p99 latency and >99.9% reliability by optimizing with Redis, native C binaries for GBDT models, gRPC, and thread pool tuning. This real-time system handles millions of daily predictions, eliminating 5% weekly coverage losses, which leads to significant business value.

Scaling Large Language Model Serving Infrastructure at Meta (49 minute video)

Meta's AI infrastructure supports massive internal workloads, such as processing hundreds of millions of examples for reinforcement learning from human feedback (RLHF) while also powering public-facing services like Meta AI and smart glasses. Scaling techniques include continuous batching, KV cache, tensor/pipeline parallelism, and quantization.

💻

Launches & Tools

Welcome to the Age of $10/month Lakehouses (18 minute read)

The $10/month Lakehouse leverages serverless technologies like Cloudflare R2 for storage, Cloudflare Containers for compute, and Neon Postgres with DuckLake for metadata management. This simple deployment has minimal cost compared to mainstream Lakehouses like Apache Iceberg or Delta Lake.

Snowflake Summit 2025 Highlights: Building the Future of AI and Apps (6 minute read)

Snowflake Summit 2025 rolled out an end-to-end AI stack inside its governed platform: agentic interfaces (Snowflake Intelligence and Data Science Agent), AISQL for multimodal queries, and built-in observability plus access to leading LLMs, all running securely within the Snowflake perimeter. Data engineers get Openflow (Apache NiFi-based ingestion), native dbt Projects in Snowsight Workspaces, deeper Apache Iceberg integration, and DevOps upgrades (Git, Terraform, and Python 3.9) to simplify pipeline creation and lakehouse interoperability. Performance and collaboration jump with Adaptive Compute auto-scaling warehouses, Semantic Views, Gen2 warehouse speed boosts, SnowConvert AI for legacy migrations, and a marketplace of agentic apps and knowledge extensions that keep governance and security intact.

Introducing Plotly Studio (10 minute read)

Plotly Studio is an AI-native desktop tool that turns raw datasets (plus optional goals) into fully functional Dash apps in about two minutes, generating ~2K lines of clean Python code and spec files that stay in sync for easy editing, review, and version control. Running entirely on the user's machine, it combines Plotly's visualization best practices with LLM “world knowledge,” giving data engineers a fast, private way to build production-ready, interactive analytics apps without setup or boilerplate.

🎁

Miscellaneous

Data Drift Is Not the Actual Problem: Your Monitoring Strategy Is (11 minute read)

Data drift alone is not the primary challenge in ML monitoring. Understanding its business impact and aligning monitoring strategies with actionable goals is crucial. Effective monitoring requires targeted metrics that capture meaningful deviations affecting model performance, rather than simply flagging statistical noise. Prioritizing business-relevant signals over raw drift detection ensures focus on genuine risks and improves model reliability.

The Belgian lab shaping modern soccer's data revolution (7 minute read)

KU Leuven's Sports Analytics Lab shows how open, university-led research can push soccer analytics forward. It releases advanced models (e.g., possession valuation and Bayesian win-probability) and open-source tools that clubs adopt without the heavy data-engineering lift. Academic labs, free from short-term performance pressure and infrastructure maintenance, can tackle harder problems (tracking data and deep-learning pipelines) than most in-house teams, underscoring the ROI of funding fundamental research.

The Power of TIG Stack: Mastering Time Series Data Management (5 minute read)

Combining Telegraf, InfluxDB 3 Core, and Grafana (the TIG stack) delivers an optimized, open-source framework for low-latency time series data management. InfluxDB 3 Core leverages Apache Arrow and DataFusion to achieve high-frequency ingestion, batched write-ahead logging, and advanced compression. Telegraf's plugin-driven data acquisition and Grafana's powerful visualization and alerting make TIG suitable for diverse telemetry and observability use cases.

⚡️

Quick Links

Kicking the Tires on CedarDB's SQL (5 minute read)

CedarDB is a new Postgres-compatible database from TUM research.

Apache Gravitino (GitHub Repo)

Apache Gravitino is a federated metadata lake that exposes a RESTful interface to manage and query metadata across heterogeneous data sources.

Curated deep dives, tools and trends in big data, data science and data engineering 📊

Join 400,000 readers for one daily email

Privacy Careers Advertise