TLDR AI 2025-09-26
ChatGPT Pulse π, Perplexity Search API π, GDPVal πΌ
ChatGPT Pulse Preview (6 minute read)
ChatGPT Pulse is a new feature for Pro users on mobile that proactively delivers personalized research and updates via visual cards based on user chats, preferences, and connected apps.
Musk's xAI sues OpenAI, alleging theft of trade secrets (2 minute read)
xAI alleges OpenAI conducted a "coordinated, unfair, and unlawful campaign" to recruit key employees and induce them to bring trade secrets, including xAI's rapid data center deployment methods.
Perplexity Search API (2 minute read)
Perplexity has released a new Search API that gives developers access to the same large-scale infrastructure behind its public answer engine, enabling high-quality retrieval over hundreds of billions of web pages.
π§
Deep Dives & Analysis
What are popular AI coding benchmarks actually measuring? (18 minute read)
Popular coding benchmarks measure something narrower than what their names suggest. Claude scoring 80% on SWE-bench doesn't translate to it one-shotting 80% of the tasks thrown at it. This post looks at SWE-bench Verified, SWE-bench Pro, Aider Polyglot, LiveCodeBench, and other benchmarks to see what they are actually measuring. Designing a good benchmark is highly labor-intensive. Without human review and annotations, it's nearly impossible to make a good benchmark.
Toward Computational Taste: LLMs, Aesthetics & Judgment (9 minute read)
LLMs are being used to model and optimize taste in diverse fields through personalized systems like Taste Engines, Aesthetic LLMs, and Taste Tribes. Methods like LoRe and models like TAPO enable LLMs to adapt to individual aesthetic preferences, potentially revolutionizing recommendation engines and social networks. This shift towards computational taste signifies a new era where machines not only reflect but also shape human preferences, offering both cultural and economic opportunities.
Is OpenAI's Reinforcement Fine-Tuning (RFT) Worth It? (32 minute read)
OpenAI's reinforcement fine-tuning (RFT) for o4-mini is supposed to be able to train models to get better at specific tasks using reinforcement learning. It costs up to 700 times more than supervised fine-tuning but only seems to deliver clear wins on agentic coding tasks. The technology gives engineers flexibility in reward design through flexible grader configurations, and the performance gains can be substantial when it works, but it is expensive with limited applicability, making its use difficult to justify.
π¨βπ»
Engineering & Research
The attack surface of on-device AI (Sponsor)
GDPval: Real-World AI Benchmarking (11 minute read)
OpenAI's GDPval is an evaluation benchmark that tests model performance on real-world, economically valuable tasks across 44 occupations.
Continuing to bring you our latest models, with an improved Gemini 2.5 Flash and Flash-Lite release (4 minute read)
Google DeepMind has released updated versions of Gemini 2.5 Flash and 2.5 Flash-Lite. The models, available on Google AI Studio and Vertex AI, feature improvements in quality and speed compared to the current stable models. They are also more efficient, with a 50% reduction in output tokens for Gemini 2.5 Flash-Lite and a 24% reduction for Gemini 2.5 Flash. The models are not intended to graduate to a new, stable version, but will help shape future stable releases.
Gemini Robotics 1.5 brings AI agents into the physical world (15 minute read)
Google DeepMind has released two models that unlock agentic experiences with advanced thinking. Gemini Robotics 1.5 is a vision-language-action model that turns visual information and instructions into motor commands for a robot to perform a task. The model thinks before taking actions and shows its process and also learns across embodiments. Gemini Robotics-ER 1.5 is a vision-language model that reasons about the physical world, natively calls digital tools, and creates detailed multi-step plans to complete a mission. These advances will help developers build more capable and versatile robots that can actively understand their environments and complete complex multi-step tasks in a general way.
Becoming a Research Engineer at a Big LLM Lab - 18 Months of Strategic Job Hunting (41 minute read)
Max Mynter signed on as a research engineer with Mistral earlier this week after years of working towards the outcome strategically. This account of Mynter's personal experiences documents how he navigated his career to get where he wanted. During the years, he would work on getting an information advantage and then act on that information to be prepared when it matters. He took the time to learn skills that can't be faked and developed a network of peers and friends who sent opportunities his way.
AI isn't replacing radiologists (21 minute read)
Improvements in image interpretation have run far ahead of their adoption. There are hundreds of models that could potentially help radiologists, but AI is often limited to assistive use on a small subset of scans in any given practice. The promise of AI in radiology is overstated by benchmarks alone. Models can lift productivity, but their implementation depends on behavior, institutions, and incentives. So far, radiologists have only become busier as machines have improved.
Build and test AI agents without leaving your IDE (Sponsor)
Amazon's Nova Act extension brings AI agent development directly into IDEs like Cursor, VS Code, and Kiro. Generate scripts via chat, test cell-by-cell, and debug with live logsβcutting development time from days to minutes.
Transform your workflow β
In Defense of AI Evals, for Everyone (7 minute read)
Evals aren't about following a rigid philosophy - sometimes it's fine to be less rigorous, and sometimes the task demands more rigor.
CoreWeave expands OpenAI partnership with $6.5B deal, bringing total to $22.4B (3 minute read)
This is CoreWeave's 3rd expansion with OpenAI this year.
OpenAI launches shared projects for ChatGPT Business (1 minute read)
OpenAI has launched shared projects, smarter connectors, and new compliance controls for ChatGPT Business.
Spotify to label AI music, filter spam, and more in AI policy change (3 minute read)
Spotify is adopting a new industry standard called DDEX that identifies how AI was used in a song's production.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email