TLDR Data 2025-09-11
DuckDB npm Attack β οΈ, 2025 Data Eng Trends π, Kestra Orchestration 1.0 Release π
Operationalizing first-party data (Sponsor)
Not all data is created equal. For advertisers, first-party data is often the most valuable asset - as it's accurate, reliable, and comes without privacy or compliance concerns.
Download this guide by OneTrust to learn:
- How to activate first party data by unifying customer profiles
- Using data to power privacy-first advertising campaigns
- Delivering better targeting and richer personalization
Get your copy here π₯
Peeking Inside the SQL Server Transaction Log (9 minute read)
SQL Server's change data capture currently relies on system change tables populated by the SQL Server Agent, which Debezium polls at configurable intervals to stream CDC events. Direct parsing of the SQL Server transaction logβmirroring Oracle CDC approachesβcould reduce latency and increase efficiency. This article details the physical storage architecture of transaction logs (virtual log files, blocks, and LSNs), data files, partitions, and data pages, with practical walkthroughs for low-level analysis using system views and DBCC utilities.
Past Year's Data Engineering and Current Trends (2025 edition) (7 minute read)
Key trends in modern data engineering include memory-first analytical caching (e.g., DuckDB, Druid, and BigQuery BI Engine) for sub-second dashboard performance and cost savings, democratization of Semantic Layers (e.g., dbt Semantic Layer, Cube, and Looker) to enforce metric consistency and prevent analytic drift, and portable query optimization frameworks (e.g., Calcite, Substrait, and DataFusion) that decouple business logic from execution. AI-powered interfaces and automation enhance self-serve analytics, data quality, and documentation. While the last decade featured a fragmentation of the stack into a myriad of tools, the momentum has shifted back to platform consolidation.
TimescaleDB to ClickHouse Replication: Use Cases, Features, and How We Built It (6 minute read)
The ClickPipes Postgres CDC connector, powered by PeerDB, enables efficient replication from TimescaleDB to ClickHouse Cloud, supporting fast parallel loads, schema changes, and comprehensive monitoring for both compressed and uncompressed hypertables. It overcomes challenges like chunk-level replication and compression by using automated parent lookups and a CTID-agnostic fallback for reliable data transfer.
Is Data Modeling Dead? (4 minute read)
The rise of big data, NoSQL databases, and cloud-native solutions has shifted focus from data modeling. Flexible schemas in NoSQL and tools like Spark or Hadoop prioritize scalability over rigid structures. However, data modeling remains valuable for specific use cases, particularly where data governance, compliance, or analytics are priorities.
Will AI Permanently Disrupt the Bundling and Unbundling Cycle? (34 minute podcast)
The data industry naturally cycles through bundling and unbundling, with major moves like Fivetran's acquisitions pushing toward bundled solutions. Despite optimism for AI to create streamlined solutions, unbundling remains highly probable, as AI often delivers functional but less efficient outcomes and larger organizations slow down in aligning complex product goals with enterprise customer priorities.
SCD2 Deep Dive with dlt: How Nested Data Affects Queries and Costs (5 minute read)
Nested SCD2 is hard to manage with JSON, but the dlt library automates it by flattening data, tracking validity (valid_from/valid_to), and generating SQL across root and nested tables. Key tips: let dlt handle versioning, use incremental loads to cut query cost by 25β35%, and note that nesting depth only modestly affects performance. Focus on schema design and extraction strategy rather than flattening depth to keep pipelines simple and efficient.
Unify Analytics & AI: Free Builder Workshops (Sponsor)
Learn to build AI-ready data foundations, integrate ML workflows, and implement architectural patterns for unified operations. Perfect for data teams looking to drive smarter decision-making through unified data and AI workflows. No vendor pitches - just practical insights for data leaders and practitioners.
Save your spot today
Can Collations Be Used Over citext? (6 minute read)
Custom nondeterministic collations outperform citext by 2-4x for case-insensitive equality and range queries, offering better performance and simpler semantics, especially in sequential scans. However, for LIKE queries, citext remains preferable due to its support for indexed pattern matching, as collations lack index optimization in PostgreSQL 18 and earlier.
DuckDB npm Packages Compromised (2 minute read)
Multiple DuckDB npm packages were compromised with malicious updates designed to drain crypto wallets on September 9. The attack reused a known payload and oddly targeted backend libraries, and the vendor responded by deprecating the affected releases.
Kestra 1.0 β Declarative Orchestration with AI Agents and Copilot (17 minute read)
Kestra 1.0 makes orchestration more powerful and easier to use, especially for data engineers. The AI Copilot turns natural language into YAML flows, and AI Agents can autonomously decide and loop tasks to meet goals. Key updates include AI-powered doc search, Git sync for full backups, plugin versioning, and unit tests to validate flows. Playground mode enables quick task-by-task prototyping, while flow-level SLAs improve reliability. New Helm charts simplify both testing and production deployments. Overall, Kestra 1.0 adds flexibility, automation, and better tooling for building and managing data pipelines.
LLM Query Performance Testing (GitHub Repo)
LLM Query Performance Testing provides a benchmark that measures how database performance impacts LLM-style chat interactions, emphasizing user experience over raw speed. It compares ClickHouse and PostgreSQL across datasets from 10k to 10M rows, with tools for bulk testing, latency simulation, and performance visualization.
The SELECT FOR UPDATE Trap Everyone Falls Into (7 minute read)
Overusing SELECT FOR UPDATE in PostgreSQL can severely degrade concurrency, causing cascading locks, deadlocks, and performance bottlenecks, particularly when handling foreign key relationships and concurrent transactions. Empirical evidence shows switching to FOR NO KEY UPDATE eliminates 70% of lock wait times and triples throughput, provided rows are not being deleted or primary keys changed. Align lock levels with actual update intent to maximize transaction performance and system stability.
Big Data on the Move: DuckDB on the Framework Laptop 13 (5 minute read)
DuckDB demonstrates impressive performance on a Framework Laptop 13 with 128 GB RAM, achieving a loading speed of 20 GB of CSV files in just 10.2 seconds and processing TPC-H queries on a 3 TB dataset in 47.5 minutes. The laptop's capabilities highlight the potential of modern laptops for handling substantial analytical workloads, although thermal management remains a concern during intensive tasks.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email