TLDR Data 2025-09-25
Continuous Testing in Airflow ✅, Netflix’s Trillion-Row Scale 🧮, Modular AI-First Stacks 🧠
Past Years in Data Engineering and Current Trends (2025 Edition - Part 2) (17 minute read)
Modern data stacks are shifting rapidly towards modular SaaS components and AI-powered capabilities, with stack templates accelerating deployment and embedding governance, while ARM and GPU architectures deliver compelling cost and throughput advantages (e.g., AWS Graviton3: 40% improved price/performance, Aerospike: 27% annual cost reduction). Unified query routing and open table formats enable multi-engine collaboration and mitigate vendor lock-in. Innovations in storage target AI/ML workloads and vector search and in-database AI functions are drastically reducing latency (sub-50ms inference), costs (e.g., 35% operational savings), and barriers to advanced analytics, transforming warehouses into active intelligence hubs.
Streaming and the RAD Stack (8 minute read)
Implementing streaming data products with the RAD (Rust, Arrow, and DataFusion) stack enables substantial throughput improvements—often 2x to 5x—thanks to efficient columnar, vectorized execution, as observed in extensive real-world benchmarks. While DataFusion's architecture offers deep extensibility, adapting it for robust streaming use cases requires targeted modifications, especially around checkpointing, sink connectors, and operator emission semantics. Iron Vector, a Rust-based accelerator for Apache Flink SQL/Table API, exemplifies this approach, delivering up to 2x performance gains without code changes.
Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale (8 minute read)
Muse began as a simple batch pipeline on Druid, but growing demands for advanced filtering, audience-based grouping, and analytics caused a combinatorial data explosion that strained query performance. Netflix re-architected it to handle trillion-row scale data by using HyperLogLog sketches for approximate distinct counts, precomputed in-memory aggregates, and optimized Druid storage/queries, which reduced P99 query latency by 50%.
What the Fuss with Fluss: Flink Delta Force (9 minute read)
Flink 2.1 introduces DeltaJoin and MultiJoin, radically reducing join state bloat by offloading history to external stores like Apache Fluss, transforming streaming joins from terabyte-scale state management to on-demand lookups with minimal checkpoint overhead. DeltaJoin enables elastically scalable enrichment use cases, with cache-managed lookups shaving recovery times from minutes to seconds, but delivers only eventual consistency. For teams managing high-volume, bounded-dimension joins in Flink, this architecture cuts operational pain and state management complexity, though workloads needing snapshot consistency or high-cardinality, high-churn dimensions require alternative engines like RisingWave or Feldera.
Orchestrating Data Quality with Airflow (7 minute read)
Maintaining high data quality is challenging due to unclear ownership, bugs, messy source data, and constant changes. The integration of continuous testing within Airflow is essential, as it allows for ongoing verification of data integrity alongside operations. By utilizing reusable task groups, data engineers can embed quality checks directly into their workflows, promoting a proactive approach to maintaining data trust and reliability.
Use Cases in Production Make a Data Career (6 minute read)
Prioritizing hands-on experience with production-grade data use cases is essential for data professionals seeking rapid career growth, credibility, and leadership opportunities. The focus should be on roles or teams actively deploying, maintaining, and owning data solutions, as deploying POCs is no longer sufficient—ownership and end-to-end involvement drive value and promotions.
Postgres' Original Project Goals: The Creators Totally Nailed It (9 minute read)
Postgres was designed in the 1980s with goals like handling complex data, being extensible, supporting triggers, simplifying recovery, using new hardware, and staying true to the relational model. Decades later, PostgreSQL has achieved all of these, with features like JSONB, PostGIS, triggers, robust crash recovery, and strong performance on modern hardware.
Seven Years of Firecracker (7 minute read)
Firecracker, AWS's lightweight virtualization technology, has evolved from its Lambda roots by emphasizing simplicity, strong isolation, and fast startup through snapshotting and cloning. It now supports Bedrock AgentCore and databases (Aurora DSQL) to ensure session and transaction isolation.
Bytewax (GitHub Repo)
Bytewax is a Python-native, stateful stream processing framework with a Rust-based engine designed to build real-time data pipelines and applications. It supports scalable deployments, manages state automatically, and offers a flexible API for operations like map, filter, join, and windowing while integrating seamlessly with Python libraries and data sources like Kafka and filesystems.
Apache Airflow 3.1 Release Imminent (3 minute read)
Apache Airflow 3.1 introduces significant enhancements, including Human-in-the-Loop integration for manual pipeline interventions, a customizable React plugin system, and various UI improvements that enhance user experience. The update includes internationalization support, making it more accessible for global teams.
On Anonymization: Creating Data That Enables Generalization Without Memorization (4 minute read)
Anonymization, rather than mere privacy compliance, is emerging as the key enabler for unlocking sensitive data for safe, responsible AI and analytics. Techniques like Microsoft's Private Evolution (PE), Google's VaultGemma, and Stained Glass Transformations (SGT) enable synthetic data generation and secure inference without revealing individual records, showing how stronger anonymization improves both generalization and model utility. Enterprise adoption by Apple, Microsoft, and Google signals a shift toward models capable of provably constraining memorization and supporting data-driven innovation under robust anonymization guarantees.
Cloudflare's 2025 Annual Founders' Letter (9 minute read)
There has been a seismic shift from traditional Search Engines to AI-powered Answer Engines. Direct answers are replacing web traffic as the Internet's core value exchange. This change is causing drastic declines in traffic (especially for media and research organizations). It threatens the sustainability of content-driven business models. A new ecosystem is emerging where compensation flows to creators of original, uniquely valuable data, with AI companies expected to financially support the content fueling their models.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email