TLDR AI 2025-08-12
OpenAI beats IOI 🏆, Claude memory 🧠, Nvidia’s physical AI 🤖
What makes AI agents successful? Query classes, LLM judges, and more (Sponsor)
Glean's AI agents do serious work at Booking.com, Duolingo, Databricks, and other major enterprises. Download this
technical whitepaper for an inside look at the technical differentiators that distinguish agents built on Glean's Work AI platform, including:
- Personalization: Tailor the agent experience by searching across a library of “golden” workflows.
- Query classes: Design common enterprise query types — like data analysis (“find all Jira tickets assigned to me”), progress summarization (“what have I worked on this week?”), and request handling (“request a new laptop”).
- LLM judges: Evaluate agent performance at scale using task-specific LLM judges and real-world eval sets for each query class.
📥 Get the whitepaper
OpenAI's reasoning system wins gold at International Olympiad in Informatics (3 minute read)
OpenAI's reasoning models placed first among AI participants and outperformed 325 of 330 human competitors at the prestigious programming olympiad. The system operated under identical constraints as the humans and used general-purpose models without competition-specific training. It jumped from last year's near-bronze to 98th percentile performance.
NVIDIA's Push on Physical AI (7 minute read)
NVIDIA unveiled innovations in neural rendering, 3D generation, simulation, and reasoning models for physical AI at SIGGRAPH 2025. Highlights included new Omniverse NuRec libraries for large-scale reconstruction, Cosmos Reason VLM for physics-aware reasoning, and updates to the Metropolis vision AI platform.
Anthropic's Claude chatbot can now remember your past conversations (2 minute read)
Anthropic released a memory function for its Claude chatbot that allows users to recall past conversations on demand. The feature supports web, desktop, and mobile, and is currently available to Max, Team, and Enterprise subscribers. Unlike OpenAI's ChatGPT, Claude doesn't build a user profile and retrieves past chats only when requested.
GPT-5s Are Alive: Basic Facts, Benchmarks, and the Model Card (30 minute read)
OpenAI released several good models at once: GPT-5, GPT-5-Thinking, GPT-5-With-The-Router, GPT-5-Pro, and GPT-5-API. These models cut down on errors and hallucinations and are easier to use. This post provides an introduction to these models, including basic facts, benchmarks, and the model card. GPT-5-Thinking and GPT-5-Pro look like substantial upgrades to o3 and o3-Pro.
)
👨💻
Engineering & Research
Your AI tools work great in isolation. That's the problem. (Sponsor)
Companies have LLM subscriptions, automated workflows, and AI agents — but they're all running in silos. The real competitive advantage isn't having AI tools, it's orchestrating them into cohesive processes. Camunda's
ultimate guide will help you turn disconnected AI experiments into integrated operations.
Get your copyLangDiff (GitHub Repo)
Modern AI applications increasingly rely on LLMs to generate structured data, but streaming these outputs poses unique challenges. LangDiff is a Python library that helps stream structured LLM outputs to frontends. It can be used to build responsive AI applications where the backend structures and frontend experiences can evolve independently.
Physically Controllable Relighting of Photographs (2 minute read)
This study presents a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. The approach combines the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. It represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools to in-the-wild relighting. A 5-minute-long video explaining the study with examples is available.
An essential primer for building enterprise-ready MCP servers (10 minute read)
MCP doesn't require compromising security - it enables security to be applied intelligently so that teams can embrace AI safely, confidentially, and without unnecessary barriers. Overly restrictive access slows innovation, frustrates developers, and prevents teams from realizing the full potential of MCP-based integrations. Organizations should already have guardrails proven to protect critical systems - they should have faith in those foundations.
Forward Deployed AI Research (2 minute read)
Forward Deployed Research is the discipline of living with a problem, building just enough to learn, throwing it into the wild, and pulling the feedback directly into the next version. It is where the real progress in AI is happening now: the people pushing the boundaries refuse to accept the traditional divide between research and deployment. AI may cause the distinction between scientist and engineer to blur.
Elon Musk's xAI Releases Grok 4 For Free Globally, Challenges OpenAI's GPT-5 Launch (2 minute read)
xAI's Grok 4 is now freely accessible to all users worldwide. Free users have generous usage limits for a limited time. Grok 4 features an Auto Mode, where the AI decides if a user prompt requires deeper reasoning or a simple response, and an Expert Mode, which allows users to manually trigger more in-depth answers. Access to Grok 4 Heavy remains exclusive to SuperGrok Heavy subscribers.
How Benchmaxxed is gpt-oss-120b? (5 minute read)
An analysis of currently available cutting-edge open source models found that gpt-oss-120b scored the worst on LiveBench, a private benchmark that uses secret questions. While the model looked good on release day, a bunch of critical reports have since been released. This isn't conclusive evidence that OpenAI is gaming benchmarks, but it brings up the question of whether it is worth creating meta-benchmarks that try to measure overfitting in more robust ways.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email