TLDR AI 2026-04-16
Gemini 3.1 Flash TTS 🎙️, Agent-to-Person marketplace 🤝, OpenAI Agents SDK 🛠️
Gemini 3.1 Flash TTS: the next generation of expressive AI speech (4 minute read)
Google's Gemini 3.1 Flash TTS enhances text-to-speech with improved expressivity and controllability, featuring a notable Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model supports over 70 languages and introduces audio tags for granular control of vocal style, allowing easy manipulation via natural language commands. All generated audio is watermarked with SynthID to ensure authentic content, preventing misinformation.
Humwork A2P marketplace connects AI agents with experts (2 minute read)
Humwork launches the first Agent-to-Person (A2P) marketplace to connect AI agents with verified human experts when AI tools encounter challenges. The platform integrates with AI-centric tools like Claude Code and Replit, allowing handoffs to occur in under 30 seconds with full session context shared securely. With more than 1,000 experts available globally, Humwork boasts an 87% resolution rate and is backed by Y Combinator's P26 batch.
OpenAI's Updated Agents SDK (4 minute read)
OpenAI introduced updates to its Agents SDK, adding a model-native harness for cross-file and tool workflows along with sandboxed execution for safer task handling.
Evaluating agents for scientific discovery (7 minute read)
Many teams are claiming extraordinary things about their agents. The evidence behind these claims is usually disappointing. ScienceWorld and DiscoveryWorld are benchmarks developed to test whether AI agents can actually do science. ScienceWorld asks whether agents can 're-make' classic scientific discoveries at roughly an elementary school level, while DiscoveryWorld tests open-ended discovery at a college or PhD level. These benchmarks, open and freely available, help test what science agents are actually capable of.
Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters (6 minute read)
Cost per token is crucial for AI infrastructure TCO assessment due to its focus on delivered intelligence, integrating hardware, software, and utilization efficiencies. Unlike traditional metrics like compute cost or FLOPS per dollar, cost per token highlights real-world performance, enabling profitable AI scaling. Evidence from NVIDIA shows its Blackwell platform drastically reduces cost per token compared to Hopper, offering significant business value.
Evaluating Agent Reasoning (28 minute read)
IBM Research uses an executable benchmark with thousands of APIs and documents to test multi-step agent reasoning and tool use, revealing consistent performance gaps and common failure modes.
Why do dLLMs tend to collapse in RL (3 minute read)
Diffusion Language Models (dLLMs) experience training collapse during Reinforcement Learning because their log-likelihood must be estimated using high-variance Monte Carlo sampling, which creates noisy importance ratios. These noisy ratios induce gradient spikes that push policy drift in a positive feedback loop, a problem that traditional AR methods like conditional clipping fail to solve. The newly proposed StableDRL framework stabilizes the update process by combining unconditional clipping to suppress extreme values with self-normalization tied to the effective information in the batch.
👨💻
Engineering & Research
Run AI agents without exposing your infrastructure (Sponsor)
You wouldn't let an unknown human into your infrastructure. Why let an unknown agent?
Teleport Beams runs each agent in an isolated Firecracker VM with built-in identity — connected to your infrastructure and inference services with no secrets and no IAM wrestling. Zero standing privileges. Fully auditable.
Get early access.
Parcae: Doing more with fewer parameters using stable looped models (6 minute read)
Parcae is one of the first stable architectures for looped language models. It achieves the quality of a Transformer twice the size with clean, predictable training. Parcae increases the recurrence rather than purely scaling data, creating a new medium to scale quality. The name Parcae is a homage to the three Roman fates: Nona, Decima, and Morta.
NVIDIA's Lyra 2 (32 minute read)
Lyra 2.0 is a framework for generating long, camera-controlled videos that maintain 3D consistency, using geometry-guided retrieval to prevent spatial forgetting and self-augmented training to reduce temporal drift.
Many-Tier Instruction Hierarchy in LLM Agents (1 minute read)
Researchers propose a Many-Tier Instruction Hierarchy (ManyIH) to address instruction conflicts in LLM agents, surpassing traditional models with fixed privilege levels. They introduce ManyIH-Bench, assessing models across 12 privilege levels and 853 agent tasks, finding current models perform poorly at 40% accuracy. This highlights the need for scalable conflict resolution in complex agentic environments.
Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia's supply chain moat (90 minute read)
This post features a transcript of an interview with Jensen Huang. He discusses TPU competition, Nvidia's lock on the supply chain needed to make advanced chips, whether the US should sell AI chips to China, why Nvidia isn't a hyperscaler, how the company makes its investments, and more. Links to audio and video of the interview are available.
Claude probably wasn't secretly nerfed. Anthropic made the black box too dark (10 minute read)
Users have accused Anthropic of nerfing Claude Code, but there's no evidence that Anthropic has done this. The strongest public reports still lack independent raw data. However, Anthropic didn't need to nerf Claude for Claude Code to become a different product. Effort defaults, adaptive thinking, cache duration, context compaction, quota policy, and status incidents can all change the experience while the model name stays the same.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email