TLDR Data 2026-05-28
Slashing Snowflake Costs ❄️, Open-Source Agent Tradeoffs 🤖, Kafka’s New Bottleneck ⚙️
Kafka Share Groups and Parallelizing Consumption — Tuning max.poll.records (14 minute read)
With Kafka Share Groups, the main bottleneck shifts from partition count to the combination of max.record.locks and max.poll.records. The default of 500 is often too high and causes “greedy capture” (a few consumers hog large batches). The recommended setting is roughly max.record.locks / consumers-per-partition (then tune slightly lower) for stable, high throughput.
How CockroachDB Built Vector Indexing at Scale (8 minute read)
CockroachDB built its own vector indexing system called C-SPANN to support scalable vector search because existing approaches like HNSW and IVF didn't fit its distributed architecture. C-SPANN uses a hierarchical K-means tree stored as regular table data, supports real-time inserts and deletes, and integrates natively with CockroachDB's sharding and rebalancing.
Design S3 Object Storage Like a Senior Engineer (31 minute read)
S3-scale object storage hinges on a flat, immutable namespace: buckets hold objects identified by keys, while metadata is separated from payload bytes so the system can scale independently. At ~100PB and hundreds of millions of objects, the design requires distributed metadata sharding, merged on-disk segment files to avoid inode exhaustion, and chunking of large objects for parallel reads and range requests.
I battletested 5 open source analytics agents (14 minute read)
Open-source “analytics agents” are often grouped together, but LangChain, Wren AI, nao, LibreChat, and Vercel's template solve very different problems, and only some are actually built for analytics. Reliable answers depend less on the agent interface and more on where business context lives, whether that's prompts, semantic models, markdown files, or the underlying MCP/tooling layer.
I Inherited a $140K Snowflake Bill — Three Months Later It Was $38K. Here's Everything I Learned (23 minute read)
Snowflake cost and performance hinge on three separable layers: storage, compute, and cloud services, with the biggest savings coming from right-sizing warehouses, aggressive auto-suspend, and reducing storage bloat from retention settings. The strongest optimization levers are physical data layout and query design: use clustering only when predicates match, avoid SELECT *, function-wrapped filters, and full reloads, and prefer incremental pipelines and pre-aggregation before joins.
AI Risk Is an Architecture Problem (20 minute read)
AI risk should be assessed at the system level, not just the model level. The three mechanism risks of data exposure, incorrect output, and unintended action map to five business harms: brand, compliance, liability, operational, and commercial risk. The most important control is architecture: what the AI can see, what its output feeds into, and what it can do without checks. Adding human review, deterministic validations, and bounded permissions can sharply reduce action risk without changing the model.
2026 State of Analytics Engineering Report by dbt Labs (Sponsor)
AI is speeding up analytics work, but the fundamentals still decide whether anyone trusts the output. dbt Labs' 2026 State of Analytics Engineering Report looks at AI-assisted coding, governance gaps, infrastructure costs, and the growing pressure to deliver reliable insights faster.
Learn more.
Scaling AI-Driven Marketing Processes with PostgreSQL (6 minute read)
Marketing teams can scale AI workflows reliably by using PostgreSQL as their central data layer via workflow state management (using ENUMs), combining relational tables with JSONB for flexibility, connecting campaigns/assets/performance data, and leveraging full-text search and pgvector for semantic context.
RushDB 2.0: Memory Infrastructure for the Agentic Era (11 minute read)
RushDB 2.0 is an agent memory infrastructure that combines graph storage, semantic search, ontology/schema discovery, MCP access, skills, analytics queries, and BYO Neo4j into one layer. Agents need structured memory and reliable context, not a separate vector store, graph DB, and schema-discovery workflow stitched together manually.
Auditing Model Bias with Balanced Datasets with Mimesis (7 minute read)
The Mimesis library can create synthetic, balanced counterfactual datasets to test whether a model contains hidden bias, such as gender, age, or ethnicity, while keeping other features consistent. This helps teams measure prediction changes and detect unwanted bias in a safe, privacy-preserving way.
MurrDB (GitHub Repo)
MurrDB is a fast NVMe/S3-backed serving cache for ML/AI inference built for batch reads/writes over large tabular data without keeping everything in RAM. It is a cheaper, lower-latency alternative to Redis for feature and document-attribute retrieval, not a general-purpose database.
Deconstructing Data Sketches (8 minute read)
Data sketches estimate expensive metrics like distinct counts by storing a small probabilistic sample, such as the lowest K hashed values, instead of scanning every row. They trade perfect accuracy for huge speed and compute savings, making them useful for large-scale dashboards, reports, and distributed aggregation.
Open Data Product SDK: Turning Data Product Ideas Into Standard YAML With AI Models (5 minute read)
Open Data Product SDK now supports AI-assisted conversion of free-form text and Markdown into standards-ready YAML for data product catalogs, item-level specs, and ODPG graph context. The workflow captures product descriptions, use cases, business objectives, and signals, then generates ODPC Catalog YAML and connected portfolio metadata. The goal is to replace manual metadata editing with a standards-first path from stakeholder language to machine-readable data product definitions.
Announcing Polars 1.41 (2 minute read)
Polars 1.41 delivers three practical gains for analytical workloads: faster Parquet footer decoding for wide tables, deeper common subplan elimination across nested query branches, and new LazyFrame.gather() support for integer-based row selection without materializing data.
Visualize the Brrr (Website)
GPUs are the hidden engines driving today's AI revolution, but most developers treat them as mysterious, costly accelerators.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email