TLDR Data 2026-01-12
Secure RAG Access 🔐, Iceberg Lacks Operational Guarantees ⚠️, Spark Declarative Pipelines ⚙️
RAG with Access Control (7 minute read)
Relationship-Based Access Control (ReBAC) models data access as a graph of relationships between users and resources, providing flexibility for dynamic and context-rich applications. SpiceDB, an open-source Zanzibar-based ReBAC implementation, can be integrated with vector databases like Pinecone to secure RAG pipelines with fine-grained policies. OpenAI uses SpiceDB to enforce access policies over 37 billion documents across 5 million ChatGPT Connector users, effectively preventing information leakage when serving domain-specific knowledge.
The Journey to Zero-Copy: How chDB Became the Fastest SQL Engine on Pandas DataFrame (11 minute read)
chDB is a Python library that embeds ClickHouse so you can run high-performance SQL directly on Pandas DataFrames without serialization overhead. It achieves true zero-copy input by automatically discovering DataFrames and wrapping their memory directly. Critical components were rewritten in C++ to avoid Python performance limits like the GIL and string encoding overhead. In v4.0, it completed the loop with zero-copy output using direct NumPy type mapping and shared memory buffers.
Apache Hudi 1.1 Deep Dive: Async Instant Time Generation for Flink Writers (11 minute read)
Apache Hudi 1.1 introduces asynchronous instant time generation for Flink writers, allowing them to request and receive a new instant time before the previous instant is fully committed, eliminating blocking delays that previously caused throughput fluctuations, backpressure, and ingestion instability during the gap between data flushing and checkpoint completion.
Beyond One-Size-Fits-All RAG: Why Different Knowledge Sources Need Different Retrieval Strategies (12 minute read)
Stop treating RAG as one vector-search pipeline: different knowledge sources demand different retrieval contracts, or your quality/latency/cost will implode in production. Use source-specific strategies: contextualized chunk embeddings for long docs, doc-level summaries + reranking for content discovery, and hybrid keyword+semantic retrieval with rule-type adjudication for compliance. The trade-off is more ingest logic and ops complexity, but you win by batching enrichment, caching judgments, and validating/grounding outputs so they can't cite beyond retrieved evidence.
Data Trust is Death by a Thousand Paper Cuts (8 minute read)
Small, accumulating failures across the clickstream data lifecycle, such as instrumentation drift, unvalidated developer changes, bot traffic inflation, ad-blocker suppression, pipeline downtime, and misconfigurations, can create significant discrepancies (e.g., 18% fewer sign-ups in analytics vs. backend database) that erode overall data trust, leading to unreliable analytics, delayed decisions, and risks for AI models.
LLM Predictions for 2026, Shared with Oxide and Friends (5 minute read)
2026 will mark the year when LLM-generated code quality becomes undeniably excellent, effective sandboxing solutions finally emerge to safely run untrusted code, and a major security incident ("Challenger disaster") exposes risks from over-privileged coding agents. Looking further ahead, the Jevons paradox for software engineering will resolve within 3 years.
A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails (5 minute read)
Apache Iceberg's REST Catalog specification ensures semantic interoperability, enabling diverse engines such as Trino, Spark, and Flink to interact seamlessly via a universal API. However, the standard omits operational guarantees (no defined latency, throughput, or synchronization SLAs), leading to unpredictable performance, high retry amplification, and systemic instability as table and catalog counts scale. This lack of operational constraints shifts the burden to clients and operators, making systems fragile and hard to maintain, highlighting the need for explicit behavioral contracts and conformance testing to ensure reliability at enterprise scale.
Scale shouldn't mean rebuilding your architecture. Handle any data volume with Fivetran (Sponsor)
OpenEverest, a Tool To Manage Multiple Databases on Kubernetes (3 minute read)
Percona has donated OpenEverest, a unified Kubernetes-native database management tool that supports PostgreSQL, MySQL, and MongoDB, to the CNCF under Apache 2.0 open source licensing. OpenEverest enables vendor-agnostic provisioning, high availability, disaster recovery, and autoscaling through standardized CRDs and RESTful APIs, simplifying database ops without requiring database-specific expertise.
Spark Declarative Pipelines Programming Guide (6 minute read)
Spark Declarative Pipelines is a declarative framework for building reliable batch and streaming data pipelines on Spark, where you define what tables and transformations should exist, and the system automatically handles orchestration, dependencies, and execution. It supports both SQL and Python APIs, along with a CLI, to make ETL development simpler and more maintainable.
Supercharging LLMs: Scalable RL with torchforge and Weaver (4 minute read)
Meta's torchforge is a PyTorch library for training LLMs with reinforcement learning that can scale to hundreds of GPUs. It hides most of the infrastructure complexity, supports fast experimentation, and uses a verifier system that cuts compute costs while improving accuracy across math, science, and reasoning tasks.
Introducing MCP CLI: A way to call MCP Servers Efficiently (7 minute read)
MCP-CLI is a lightweight, open-source command-line tool that enables efficient, dynamic interaction with Model Context Protocol (MCP) servers. By supporting just-in-time tool discovery and execution instead of statically loading all tool definitions, it drastically reduces token consumption (up to 99% savings), making it ideal for AI coding agents like Gemini CLI or Claude Code.
Databricks x Palantir | Partnership Deep Dive (16 minute video)
Databricks and Palantir are building a two-way integration so teams can access the same data from either platform without copying it. The integration centers on data federation, governance, compute pushdown, and workflow and model sharing.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email