TLDR Data 2026-05-25
Iceberg for AI π€, HashMap Freeze Lesson π§, Choosing Graph Models πΈοΈ
Play to win with NVIDIA at Microsoft Build, online & on-site (Sponsor)
Unlock developer-first, hands-on experiences with NVIDIA agentic solutions on Azure at Microsoft Build, happening in San Francisco and online June 2β3.
From hands-on labs and demos to speaking sessions and interactive events, Microsoft Build offers developers a unique opportunity to go deep on real code, real systems, and real workflows.
New this year: Online only, June 1β5, visit the NVIDIA Builder's Arcade daily for developer challenges and the chance to score exclusive NVIDIA discounts.
Learn more
The 58-Million-Key Freeze: What a HashMap Resize Taught Us About Memory Allocation at Scale (10 minute read)
LinkedIn experienced a production incident where its Rust-based FishDB service would completely freeze for 10-15 seconds, breaching availability SLOs. The root cause was a standard library HashMap resizing at exactly 58,720,256 keys, which triggered a massive memory allocation via mmap. This acquired the process-wide mmap_lock in write mode, blocking all other threads on madvise and page faults, freezing the entire async runtime.
Choosing the Right Graph (28 minute read)
RDF/OWL is better for governed, interoperable knowledge with formal meaning, reasoning, provenance, and linked-data publishing. Labeled property graphs are better for fast traversal, rich edge properties, and developer-friendly graph analytics, though RDF 1.2 narrows the gap with native statement annotations.
The Hugo evolution: Engineering Grab's unified, one-click data ingestion platform with Apache Flink (4 minute read)
Grab unified its self-service data ingestion into one automated Flink-based workflow for RDS CDC and Kafka pipelines. The new platform cuts onboarding from days to minutes, reduces schema and governance issues early, and lowers operational overhead.
From Batch to Streaming and AI, Iceberg for Everyone by Everyone (34 minute video)
While Apache Iceberg has seen strong success from batch analytics in v1 to the recent v3 table spec, which added vendor-neutral support for semi-structured data and improved deletes, the format still requires significant enhancements for low-latency streaming and AI workloads. The community is working on V4 to support One File Commits, better column statistics, and columnar metrics, to make Iceberg truly universal.
Plan Mode All the Time, Substrait over SQL, and the End of the DE Role ft (15 minute read)
AI is already strong enough to handle much of data engineering, especially with declarative workflows, and strong quality gates. Use βplan mode,β fresh context resets, and external tests to manage LLM non-determinism. Substrait-like format may be more appropriate than SQL for agents to express transformations, as they convey physical operations. Data engineering may blur into a broader βdataβ role as agent ergonomics start to matter more than human ergonomics.
Of Hammers and Nails: What AI Can and Cannot Do for a Data Analyst (6 minute read)
AI helps data analysts write code, prep data, and draft analysis faster, but it is still too inconsistent for trusted ad hoc answers. Good analysis still needs clean data, context, judgment, and human knowledge.
DuckDB 1.5.3: Not an Ordinary Patch Release (4 minute read)
DuckDB v1.5.3 is framed as a patch release, but it adds major extension-driven features, including Quack as a core beta extension, DuckLake support for Quack, and new AWS, HTTPS proxy, and Iceberg capabilities. It also includes internal packaging and security-related fixes, with Quack expected to become production-ready alongside DuckDB v2.0 in fall 2026.
Introducing Dimster, a performance benchmarking tool for Apache Kafka (13 minute read)
Dimster is an open-source Kafka benchmarking tool that makes it easier to test performance across different workloads and configs. It supports throughput, peak-rate, backlog-drain, and correctness tests, with results shown in charts and Grafana dashboards.
pg_infer 1.0.0 released -- transformer model knowledge as SQL relations (4 minute read)
pg_infer is a PostgreSQL 18+ extension that makes transformer internals queryable in SQL, so model inference can be costed, parallelized, joined, and filtered like regular database work. It runs efficiently on CPU, supports BitNet models, and can offload inference to replicas or DR hosts.
Same buffers, same instructions, same hardware. Where Is the JVM Tax? (17 minute read)
To challenge the common claim that the JVM imposes a significant performance penalty on analytical workloads, simple vectorized arithmetic kernels running directly over Apache Arrow buffers in pure Java are benchmarked against native arrow-rs. The results showed comparable performance, proving that on the same columnar memory layout and hardware, a warmed-up JVM does not impose any mysterious βtaxβ on raw compute kernels.
SAM 3: Segment Anything with Concepts (GitHub Repo)
SAM 3 is Meta's new image and video segmentation model that can find and track open-vocabulary concepts from text or visual prompts. It improves on SAM 2 with broader concept coverage, a new architecture, and strong SA-Co benchmark results.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email