TLDR AI 2025-07-21
OpenAI achieves IMO gold ⛓️, context engineering lessons 🧑💻, LLM accounting 🧾
Can LLMs Do Accounting? (20 minute read)
When tasked with “closing the books” using real SaaS company financials, frontier models excel in the first month but accumulate catastrophic errors quickly. Models either give up entirely (o3 and Gemini) or resort to fabricating transactions and pulling unrelated entries to force reconciliations to pass validation checks, causing misstatements of up to $500,000.
Speeding Up Diffusion Models with torch.compile (19 minute read)
Integrating torch.compile with Hugging Face Diffusers significantly boosts diffusion model performance with minimal code changes. This post outlines strategies for model authors and users to reduce recompilations, leverage full graph compilation, and optimize for hardware constraints.
I built an MCP Server for Observability. This is my Unhyped Take (10 minute read)
MCP servers play a significant role in providing an additional interface between developers and observability platforms. MCP-powered agents are not bringing us closer to automated problem-solving - they're giving us sophisticated hypothesis generators. The unknown remains the domain of the human engineer. AI can brainstorm, but can't yet reason - recognizing this distinction is the key to using AI tools effectively without falling for the hype.
👨💻
Engineering & Research
The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2 (25 minute read)
Seven years after GPT's debut, modern LLMs still share surprisingly similar foundations despite surface-level innovations like Multi-Head Latent Attention and Mixture-of-Experts. However, open-source models reveal clever mathematical optimizations on top of this shared foundation, like DeepSeek's compressed KV caching, Gemma's sliding window attention, and the emerging trend toward sparse MoE designs that activate only subsets of massive parameter counts during inference.
Context Engineering for AI Agents: Lessons from Building Manus (17 minute read)
Context engineering is still an emerging science. While models may be getting better, no amount of raw capability replaces the need for memory, environment, and feedback. Context defines how fast agents behave, how well they recover, and how much they scale. This article shares patterns that worked for the team at Manus during development. The lessons were learned through repeated rewrites, dead ends, and real-world testing across millions of users.
Virtual Cell Challenge from Arc Institute (12 minute read)
Arc Institute launched the Virtual Cell Challenge, inviting participants to build models that predict how silencing a gene affects a cell, even in previously unseen cell types.
Inside Windsurf's Weekend Acquisition (3 minute read)
Shortly after Google hired Windsurf's senior research talent and CEO, Cognition's Scott Wu cold-texted the company's newly-appointed CEO at 5:30 pm Friday, pitching an acquisition. Over a 72-hour weekend sprint, they structured a deal combining Windsurf's enterprise sales operation with Cognition's Devin engineering team while ensuring all 250 employees received accelerated vesting and payouts following the sudden OpenAI deal collapse.
Meta declines to abide by voluntary EU AI safety guidelines (4 minute read)
Meta refuses to sign the EU's voluntary AI safety guidelines, citing legal uncertainties. The guidelines aim to prevent harmful content from AI models with high computing resources. Failure to comply with the voluntary Code of Practice might subject Meta to more regulatory scrutiny.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email