TLDR AI 2025-10-28
Claude for Excel 📊, Mercor hits $10B 💰, Qualcomm AI chips 💻
The Coding Personalities of Leading LLMs (Sponsor)
Think the newest LLM is the best for coding? Despite their shared strengths and flaws, each LLM has a unique and inherent style—a measurable “coding personality” that drives their distinct results.
Sonar put GPT-5, Claude Sonnet 4, and Llama 3 models to the test in their latest State of Code report - and found some surprising results. Read it to discover:
- The shared strengths and flaws of LLMs
- Coding archetypes for the leading LLMs
- Hidden quality & security risks of using LLMs for code generation
- How to select the best model for your needs
Get the report ↗️
Advancing Claude for Financial Services (4 minute read)
Anthropic is continuing its push into financial services by launching Claude for Excel, which lets users analyze and modify spreadsheets by chatting with Claude in a sidebar. It is currently in beta behind a waitlist.
The AI Startup Fueling ChatGPT's Expertise Is Now Valued at $10 Billion (5 minute read)
Mercor, a human resources and recruiting startup, has finalized a new funding deal that values the company at $10 billion. The company manages 30,000 contractors around the world who help AI chatbots learn how to think and speak like humans. Venture capital firms are eager to offer funding to founders at sky-high valuations because they fear missing out on the next AI tech giant. Mercor is currently being sued by Scale for allegedly stealing trade secrets.
AI Discovers Novel Cancer Drug, or Did It? (5 minute read)
Google recently published an announcement that its Gemma model had discovered a new potential cancer pathway. It's important to note that the model didn't invent or theorize a method to discover potential drug candidates - the entire construction of the experiment was imagined and executed by humans. The model was used to probabilistically narrow a set of potential candidates to a manageable set that could be inspected by human review. Humans still possess a reasoning capability that is not present in any current AI system.
Speedrunning an RL environment (22 minute read)
RL environments can be surprisingly complex and fun to create. This post explains what RL environments are, introduces the 'verifiers' framework, and also walks readers through how to create an environment for a benchmark called AgentDojo. RL environments are scenarios that LLMs operate in for evaluation or training. Designing them means essentially defining the maze, the rewards, and how the LLM navigates through it.
👨💻
Engineering & Research
The 5-Step Playbook for Finding Your Top AI Use Cases (Sponsor)
Reasoning with Sampling (6 minute read)
A new sampling method enables base models to reach single-shot reasoning performance comparable to reinforcement learning, without retraining or external verifiers, while maintaining diversity and multi-shot accuracy.
On-Policy Distillation (25 minute read)
Thinking Machines Lab demonstrated that training smaller AI models by having them learn from their own mistakes, by grading them with a larger teacher model, achieves the same reasoning performance as RL at 9-30x lower cost. "On-policy distillation" combines the relevance of learning from your own outputs with the dense feedback of traditional distillation, reaching 70% on AIME'24 math problems in just 150 training steps compared to 17,920 GPU hours of RL.
Token-Oriented Object Notation (TOON) (GitHub Repo)
TOON is a compact, human-readable format designed for LLM input that allows developers to pass structured data to LLMs with significantly reduced token usage. It excels at uniform complex objects. Standard JSON is verbose and token-expensive - TOON conveys the same information with fewer tokens. In benchmark tests, TOON achieves higher accuracy compared to JSON while using significantly fewer tokens.
Qualcomm announces new AI chips in data center push, shares surge (2 minute read)
The smartphone chipmaker's new AI200 and AI250 inference accelerators—shipping in 2026 and 2027—sent shares up 20% on the promise of diversification beyond a stagnating mobile market. Saudi Arabia's Humain is committed to purchasing roughly $2 billion worth of the chips starting next year.
MiniMax M2 (3 minute read)
Chinese Lab MiniMax open-sourced M2, a 230B parameter model with 10B active that ranks #1 among open-source models on Artificial Analysis' composite intelligence score. It slightly outperforms the previous generation Claude Sonnet model on coding benchmarks at less than 10% of the cost.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email