TLDR Data 2025-04-07

Transitioning to Data Mesh 🕸️, Llama 4 on Databricks 🦙, Streaming for AI Agents 🔄

📱

Deep Dives

Normalizing Repeated JSON Fields in FDA Drug Data Using DuckDB (7 minute read)

Normalizing nested JSON fields in data with DuckDB can drastically boost query performance and reduce memory usage. By extracting high-cardinality fields into lookup tables and using integer indexing, a previously sluggish dataset transforms into a lean, efficient structure ideal for rapid analysis.

Security Data Engineering: From Data Deluge to Agent-First Security (10 minute read)

Security data engineering is being reinvented by leveraging LLMs and agent-based architectures to tame the massive, unstructured volumes of logs, events, and telemetry data generated across modern enterprises. Advanced AI-driven ETL pipelines now extract, filter, and enrich security data in real time—dramatically reducing storage costs and accelerating threat detection—while open standards like OpenTelemetry and open table formats foster a more interoperable observability ecosystem.

The Hard Things About Sync (13 minute read)

Sync engines, crucial for real-time data synchronization between servers and clients, significantly enhance web applications' responsiveness and collaboration capabilities, as seen in platforms like Figma and Google Docs. At Figma, the sync engine LiveGraph exemplifies a read-path sync design, maintaining ACID properties by leveraging a write-ahead-log. Key challenges in developing sync engines include state management, efficient query matching, permissions, schema versioning, conflict resolution, and scalability. For professionals evaluating sync solutions, considerations on latency, data volume, query complexity, and offline support are vital in determining whether to build or adopt existing technologies.

Wayfair's Multi-year Data Mesh Journey (29 minute video)

Wayfair transitioned from a centralized data model to a decentralized Data Mesh architecture over several years, empowering domain teams with end-to-end data ownership through data contracts and a unified internal ontology. This shift enhanced data quality, discoverability, and business outcomes across the organization.

🚀

Opinions & Advice

Scaling Data Governance without falling into the Data Mesh trap (6 minute read)

Data Mesh often fails when autonomy is pursued without strong governance, leading to disorganization and a return to centralization. Success lies in starting with mature domains, defining clear ownership, and setting guardrails to balance freedom and control. A central governance team supports this by enabling a data marketplace where teams can share high-quality, compliant data assets.

The AI Silo Problem: How Data Streaming Can Unify Enterprise AI Agents (7 minute read)

Data streaming can bridge the gap between isolated AI agents, unifying them into a coordinated ecosystem that enhances real-time decision-making. Enterprises can break down silos by leveraging platforms like Apache Kafka and Flink, ensuring that insights flow seamlessly across systems—from CRM to data warehouses. This event-driven approach not only streamlines operations but also drives efficiency and innovation across the enterprise.

How to Refactor Your Tracking Design (7 minute read)

Legacy tracking implementations often become a tangled mess, resulting in inconsistent data and unreliable insights. A structured refactoring approach centered on clear event schemas, consistent naming conventions, and centralized monitoring ensures tracking systems are maintainable, scalable, and aligned with evolving business needs, ultimately empowering teams with more accurate analytics and decision-making capabilities.

💻

Launches & Tools

MarkItDown (GitHub Repo)

MarkItDown is an open source lightweight Python utility from Microsoft that converts diverse document formats into Markdown while preserving key structural elements for efficient text analysis by LLMs. It supports multiple file types and plugins, making it a versatile tool for integrating document conversion into data pipelines.

Introducing Meta's Llama 4 on the Databricks Data Intelligence Platform (4 minute read)

Meta's Llama 4 is now part of the Databricks Data Intelligence Platform, enabling enterprises to build domain-specific AI agents and copilots that work directly with their data. The Llama 4 Maverick model offers faster inference and supports longer context windows along with multilingual capabilities, making it ideal for complex tasks like document understanding and workflow automation. It comes with built-in governance features such as logging, rate limiting, and PII detection to ensure secure and compliant AI deployment. This integration provides data professionals with a scalable, efficient solution to run advanced AI models seamlessly within their existing pipelines.

🎁

Miscellaneous

What is Data Architecture and why Data Engineers should consider it (7 minute read)

Data architecture maps how information flows through an organization, aligning storage and processing with business objectives. Effective systems scale for traffic spikes while flexibly integrating new tools without disruption. Reliable pipelines with retry mechanisms and backups ensure continuity during failures.

How to evaluate an LLM system (8 minute read)

Evaluating LLM systems is complex due to their probabilistic and variable outputs, making traditional testing methods insufficient. LLM systems require benchmarks and evals. Benchmarks use public datasets and metrics to compare model performance objectively. Evals analyze model behavior in context, including sensitivity to input changes, hallucination detection, and output quality.

⚡️

Quick Links

Iceberg is not a database, it's a spec (1 minute read)

Iceberg is a specification for managing data, not a database, so features like compaction rely on implementations (e.g., Spark or Trino) that you must actively run.

The Hidden Cost of Starting Databricks Clusters: How VMs Traffic Can Drain Your Budget (6 minute read)

Databricks clusters can incur unexpected costs as each node downloads over 12 GB of data during startup, leading to significant firewall traffic and billing spikes.

Curated deep dives, tools and trends in big data, data science and data engineering 📊

Join 400,000 readers for one daily email

Privacy Careers Advertise