TLDR AI 2025-06-16
Anthropic multi-agent architecture š¤, AMD AI rack analysis š», Google leaves Scale š°
AMD's AI Future is Rack Scale 'Helios' (6 minute read)
AMD's MI400 will rival Nvidia's Blackwell chips with rack-scale architecture that enables thousands of GPUs to function as unified systems. The company claims 40% better tokens/$ compared to NVIDIA. Its roadmap includes a path to 20x rack-scale energy efficiency by 2030.
Google, Scale AI's Largest Customer, Plans Split After Meta Deal (5 minute read)
Meta's $14 billion acquisition of a 49% stake in Scale AI prompted Google to pull its planned $200 million contract for human-labeled training data over fears of exposing sensitive data to Meta. Microsoft, xAI, and OpenAI are also backing away from Scale AI over the same competitive concerns. The exodus benefits Scale's competitors, with Labelbox expecting "hundreds of millions" in new revenue as AI labs seek neutral data providers or move operations in-house.
š§
Deep Dives & Analysis
Low-Bit Quantization with ParetoQ (19 minute read)
ParetoQ is a new training algorithm that unifies binary, ternary, and 2-to-4 bit quantization, achieving state-of-the-art results across all levels.
Have LLMs Finally Mastered Geolocation? (10 minute read)
Open-source intelligence researchers tested 20 AI models on 500 geolocation challenges using unpublished travel photos to ensure models couldn't rely on memorized training data. OpenAI's latest models outperformed Google Lens by cross-referencing architectural styles, vegetation patterns, and partially visible text, while competitors like Claude often managed only continent-level guesses. However, all models still hallucinated. "Deep research" modes paradoxically performed worse than standard versions.
šØāš»
Engineering & Research
š Kiss bugs goodbye with fully automated end-to-end test coverage (Sponsor)
QA Wolf's AI-native service gets web and mobile apps to 80% automated test coverage in less than 4 months.
They create and maintain your test suite in open-source Playwright. Plus, they provide unlimited parallel test runs on their infrastructure (24-hour maintenance included).
The result? Salesloft saves $750k/year in QA engineering + executes 300+ tests in parallel on every PR in minutes.
ā Rated 4.8/5 on G2. Trusted by Cohere, AutoTrader, Mailchimp, and many others.
Schedule a demo to learn more
New Insights for Scaling Laws in Autonomous Driving (4 minute read)
Research from Waymo confirms that, similar to language modeling, increased data and compute resources can enhance the performance of autonomous vehicles. The finding has exciting implications for the development of autonomous vehicles as researchers and developers now know with certainty that enriching the quality and size of the data and models will deliver better performance. It opens up the possibility of devising more adaptive training strategies for planning tasks in robots.
Self-Adapting Language Models (30 minute read)
A new training approach enables LLMs to generate "self-edits" that produce persistent weight updates through supervised fine-tuning. The framework outperformed GPT-4.1 despite using a smaller model, but suffered from catastrophic forgetting and required 15x more tokens than standard inference. This addresses the looming data wall and limitations on personalization and memory by enabling models to bootstrap their own improvement through self-generated training material rather than relying on external human-generated text.
How Anthropic Built Their Deep Research System (15 minute read)
In this engineering blog post, Anthropic details its findings in prompt design, tool coordination, and production reliability challenges when orchestrating multi-agent systems. Its design uses an orchestrator-worker pattern where a lead agent spawns specialized sub-agents that search in parallel, far outperforming a single-agent Opus-based method. Token usage alone explains 80% of performance variance, with multi-agent systems consuming 15x more tokens than a regular chat but enabling much more complex research tasks.
The AI Eval Flywheel: Scorers, Datasets, Production Usage, & Rapid Iteration (10 minute read)
There was a surprising consistency in the general eval frameworks discussed at the 2025 AI Engineer World's Fair. Most of them involved structuring inputs and how outputs are evaluated and evolving these inputs and evals based on real production usage. The more they can iterate, the better the experience they can deliver, so they try to make eval flywheels as quick and frictionless as possible. One of the key ideas is to have 'playgrounds' to make it easy to tweak a feature and run it against datasets and evals.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email