TLDR Data 2025-06-16
Unified Data Architecture πΈοΈ, dbt Fusion Containers π¦, DuckDB Text Analytics π¬
Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix (15 minute read)
Netflix's Unified Data Architecture (UDA) is a knowledge graph-based system that enables consistent modeling of business concepts across various platforms, enhancing automation and data interoperability. It uses RDF and SHACL to create a unified data catalog and schema registry, connecting domain models to data containers like GraphQL and Iceberg tables.
Lightweight Text Analytics Workflows with DuckDB (9 minute read)
DuckDB streamlines advanced text analytics by integrating keyword, full-text, and semantic search directly within a high-performance SQL environment. Leveraging experimental FTS and vector similarity extensions, it enables direct access to over 150,000 Hugging Face datasets, efficient text tokenization, stopword management, embedding generation (FLOAT384 format), and hybrid search scoring using BM25 and cosine similarity.
Unlocking Efficient Ad Retrieval: Offline Approximate Nearest Neighbors in Pinterest Ads (6 minute read)
Pinterest developed an offline approximate nearest neighbors (ANN) system to enhance ad retrieval efficiency, enabling faster and more relevant ad delivery by processing large-scale ad corpora. The system leverages a two-tower model and hierarchical navigable small world graphs to perform sub-linear searches, significantly reducing computational costs.
A Platform-centric Approach to AI-assisted Code Generation at Intuit (7 minute read)
Intuit's platform-centric approach leverages a coding assistant aware of Intuit's context to enhance developer productivity for products like TurboTax and QuickBooks. By relying on golden repositories (curated, high-quality code and data sources), the system ensures contextually relevant, consistent, and compliant AI-assisted coding.
Data, AI, Market Consolidation, Platform Wars, and the Cost of Governance Silos (2 minute read)
Recent major acquisitions (Salesforce-Informatica, Databricks-Neon, and Snowflake-Crunchy Data) signal intensifying competition for AI and data workloads, with both AI advances and cloud providers driving demand. Unified, decoupled data governance is emerging as essential, as governance silos create risks around security, privacy, and efficiency across proliferating platforms. Open standards (e.g., OpenLineage and Open Data Contracts) and governance abstraction are key strategies that enable scalable, cost-effective, and flexible data estates amidst rapid market consolidation.
The Reflexive Supply Chain Stack (7 minute read)
Supply chains suffer from the bullwhip effect, small demand shifts that amplify upstream, because decision points are disconnected from realβtime signals. A reflexive stack remedies this by layering Sensing (continuous, granular telemetry across touchpoints), Thinking (real-time analytic engines that diagnose and optimize flows), and Acting (automated control loops that trigger replenishment, routing, or pricing adjustments). This closed-loop architecture slashes decision latency from days to minutes, smooths inventory oscillations, and boosts product freshness and availability simultaneously.
Why You Should Use Dev Containers with dbt Fusion (8 minute read)
dbt Fusion is a new Rust-based transformation engine that offers performance and developer experience improvements over dbt Core, but introduces tooling conflicts due to its standalone binary nature. Dev containers offer an isolated, reproducible environment to run dbt Fusion safely alongside existing setups, streamline onboarding, and ensure consistent builds across teams.
π¦ DuckLake: Taming the Data Lake with SQL (Sponsor)
Open table formats got you closed off? DuckDB's new DuckLake offers a simpler, faster approach. It's an open data lakehouse format that uniquely uses a standard SQL database for metadata, eliminating the need for complex file-based systems. This means faster queries, simplified management, and ACID transactions across your data lake. More details?
Watch the video or
read the blog.
Cost-Effective Logging at Scale: ShareChat's Journey to WarpStream (9 minute read)
ShareChat replaced Kafka with WarpStream, a stateless, zero-ops streaming platform, cutting infrastructure costs by up to 60% and simplifying scaling with Kubernetes integration and S3-backed storage. Advanced compression (ZSTD) and role-based agents boosted throughput and reduced latency, while Spark processing accelerated by 26Γ, demonstrating significant operational and cost benefits for high-volume log ingestion.
Building an Agentic App with ClickHouse MCP and CopilotKit (12 minute read)
An agentic application using ClickHouse MCP Server and CopilotKit's chat interface transforms natural language prompts into dynamic analytics dashboards. The selected large language model interprets user prompts, queries UK real estate data in ClickHouse, and generates interactive charts.
Pydoll (GitHub Repo)
Pydoll is a Python-based browser automation library that ditches traditional WebDrivers by using the Chrome DevTools Protocol. It features native CAPTCHA bypass (including Cloudflare Turnstile and reCAPTCHA v3), human-like behavior simulation, and full async support, making it ideal for modern scraping and automation on protected websites.
From Ideas to Impact: The GenAI Q&A for Innovation Leaders (2 minute read)
Generative AI (GenAI) empowers innovation leaders to transform ideas into impactful solutions by enhancing creativity, streamlining processes, and personalizing customer experiences. Key strategies include leveraging robust data management for accurate AI outputs, adopting retrieval-augmented generation (RAG) to ground AI in reliable data, and ensuring ethical governance to address biases and privacy concerns.
Coordinated Progress β Part 1: Seeing the System: The Graph (5 minute read)
Modern distributed architectures can all be modeled as graphs, where nodes represent work units and edges represent triggers or data flows. Reliable progress in such systems hinges on making each node's work durable and each edge's delivery dependable, a gap that Durable Execution Engines (DEEs) like Temporal and LittleHorse aim to fill.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email