TLDR AI 2025-12-22
ChatGPT tone control 🤖, gaming the Metr plot 📈, Anthropic Bloom 🌸
Retrofitting legacy systems for AI (Sponsor)
Many financial systems of record (like ERP) were built at a time when AI tools were the stuff of science fiction. Now, they need to work together.
On the latest episode of Unpack Pricing, Metronome CEO Scott Woody talks with Rillet CEO Nicolas Kopp about why old financial systems often have a "garbage in, garbage out" problem for modern companies running AI.
They get into why gen AI is famously bad with numbers (and where it actually helps), why 20 years of finance experience might now be a hiring red flag, and how one customer went from six-month pricing changes to two-hour launches.
Listen now →
ChatGPT Adds Tone Personalization (1 minute read)
OpenAI has introduced new personalization options in ChatGPT, letting users adjust enthusiasm, warmth, and emoji use directly. These controls, available in the Personalization menu, offer "More," "Less," or "Default" settings, expanding tone customization beyond the existing base style and tone feature.
Cursor Acquires Graphite (2 minute read)
Cursor has acquired Graphite, a company known for its performance-focused internal developer portal. This marks Cursor's third acquisition as it aims to build a comprehensive AI-powered dev platform.
Introducing Bloom: an open source tool for automated behavioral evaluations (7 minute read)
Anthropic's Bloom is an open-source tool for generating automated behavioral evaluations of AI models. Bloom assesses specific behaviors like self-preferential bias and sabotage by creating scenarios and quantifying behavior occurrence across models. It efficiently differentiates between aligned and misaligned models and correlates strongly with human judgment, enabling scalable and reliable behavior evaluations.
Experiment Diary (3 minute read)
This document contains a diary for an experiment aimed at teaching an LLM using GRPO to generate regex given a description. It details the performance, learnings, modifications, and key takeaways from each experiment. The initial training run was on December 17. It saw the model quickly learning how to generate valid regex tags, but the model was basically generating random regex strings.
The changing drivers of LLM adoption (15 minute read)
LLM use is rising. People are increasingly using different LLMs, different products, and in different places. ChatGPT remains dominant and keeps acquiring new users, but Gemini's growth has been faster over the last few months. OpenAI's revenue seems to be on track, but consumer revenue is likely decreasing as a share. A substantial share of workplace AI use involves workers adopting tools on their own rather than waiting for employer-provided access.
Understanding AI Benchmarks (25 minute read)
Benchmarks are the most widely misunderstood part of the AI ecosystem. The narrative keeps implying a universal increase in intelligence, but the numbers can be misleading. To navigate this noise, look at the aggregate, look at the relative, and verify with your own tasks. The only benchmark that matters at the end of the day is your own workload.
Andrej Karpathy's 2025 LLM Year in Review (6 minute read)
Andrej Karpathy has outlined paradigm shifts of LLMs in 2025, including fast inference engines, model distillation trends, real-time agents, neural GPUs, and the rise of high-quality open models like DeepSeek-V2 and RWKV.
Evaluating Context Compression for AI Agents (10 minute read)
What happens when agents run out of memory determines whether they continue productively or have to start from scratch. This post explores an evaluation framework that measures how much context different compression strategies preserve. Structured summarization retains more useful information than alternative methods without sacrificing compression efficiency.
👨💻
Engineering & Research
Could public AI tools be leaking your sensitive data? (Sponsor)
One in three employees uses AI tools without approval, risking data leaks and compliance violations. With this
Enterprise AI Governance Kit from
You.com, you will get ready-to-use templates to help protect your organization, including security checklists, usage policies, and governance frameworks.
Download the kit.
Introducing MiMo-V2-Flash (10 minute read)
MiMo-V2-Flash is a powerful, efficient, and ultra-fast foundational language model that excels in reasoning, coding, and agentic scenarios. It serves as an excellent general-purpose assistant for everyday tasks. The model is available globally on Hugging Face, AI Studio, and Xiaomi's API platform. Benchmark results are available in the article.
jax-js (GitHub Repo)
jax-js is a machine learning framework for the browser. It brings JAX-style, high-performance CPU and GPU kernels to JavaScript, so users can run numerical applications on the web. The library is written from scratch and has no external dependencies. It can run anywhere a browser can run.
tcgen05 for dummies (70 minute read)
tcgen05 is the set of PTX instructions that programs Tensor Cores on the latest NVIDIA Blackwell GPUs. This post contains a tutorial for Blackwell in plain CUDA C++ with PTX. It documents the author's process of learning tcgen05 and reaching 98% of CuBLAS speed. Readers can follow the tutorial using Modal or any other B200 cloud providers.
Multiplexing MCP Servers For Agentic Specialization (8 minute read)
MCP servers give agents the tools they need to accomplish tasks. This post discusses how to multiplex MCP servers to simplify the connection to various tools within them. Multiplexing allows multiple MCP servers to be used over a gateway in a single interaction. It allows agents to access multiple MCP servers with different stacks, clouds, applications, and frameworks for specialized tasks.
Qwen-Image-Layered (GitHub Repo)
Qwen-Image-Layered is a model capable of decomposing an image into multiple RGBA layers. Each layer can be independently manipulated without affecting other content. They can be resized, repositioned, and recolored. The approach enables high-fidelity and consistent editing.
How to game the METR plot (9 minute read)
METR topics are public, making it easy to game METR horizon length measurements for a frontier lab. The horizon length under METR's assumptions might be adding little information beyond benchmark accuracy. There is a meme going around based on a team achieving a one to four hour range on the METR plot. This post explains why the plot has been interpreted incorrectly.
The Shape of AI: Jaggedness, Bottlenecks, and Salients (11 minute read)
AI is incredibly good at some tasks while being really bad at others. This is the 'Jagged Frontier' of AI ability. The jaggedness is likely going to remain a big part of AIs going forward. However, the growing frontier will outpace jaggedness.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email