TLDR AI 2026-05-18
Gemini Extended Thinking β¨, ChatGPT finance π±, Claude Code at scale π¨βπ»Β
ChatGPT Personal Finance (6 minute read)
OpenAI released a preview of a new personal finance experience in ChatGPT for Pro users in the US. The feature lets users securely connect financial accounts, view spending dashboards, and ask questions grounded in their financial context and goals.
Gemini app rolling out βExtended' thinking level, new 3rd-party app integrations (3 minute read)
Google is rolling out a new 'Thinking level' option for Gemini. The option has appeared for some users when they select Fast or Gemini 3.1 Pro. Google is also preparing to add more integrations with third-party apps in Gemini. Support for Canva, Instacart, and OpenTable appears to be coming.
Codex will soon be able to control other desktop devices via Computer Use (2 minute read)
OpenAI is working on a capability that lets its coding agent operate macOS applications through Computer Use even when a laptop is locked or asleep. Computer Use currently requires an unlocked, awake session to see the screen, move the cursor, and type. Lifting the restriction will allow users to direct their agents without having to walk back to their machines to log in first. It is unknown when the feature will be released.
π§
Deep Dives & Analysis
AI economics part 2 (11 minute read)
AI labs are in an ongoing war over GPU resources. That article looks into demand and supply and how the infrastructure powering AI today may not be sufficient. Scaling GPUs doesn't scale compute linearly. Efficiency matters more at raw scale given finite supply.
Portability Is a Myth: Why the Best AI Stacks Will Never Be Hardware-Agnostic (15 minute read)
AI kernel portability is structurally impossible because TPU's Pallas, NVIDIA's CuTile and CUTLASS, AWS's NKI, AMD's FlyDSL, and Tenstorrent's tt-Metalium each expose hardware-specific concepts that no universal DSL can unify. The evidence: MaxText's MoE grouped matmul ships as 282 lines of Pallas on TPU while flashinfer's equivalent for Blackwell SM100 takes 4 million lines of generated CUDA, with zero shared code because the algorithms themselves diverge across hardware.
Tokenomics: the 62.5-minute rule for Claude's cache (8 minute read)
If you expect to need a cache before 62.5 minutes, refresh it. Otherwise, let it expire. This number stays the same between models, and it doesn't change, no matter the size of the cache. The amount of dollars may change, but the decision point is still the same.
π¨βπ»
Engineering & Research
May 26 workshop: Agent orchestration on AWS (Sponsor)
Multi-agent AI systems fail when agents can't share state, coordinate approvals, or recover from failures. The root cause: no orchestration layer managing execution and approval gates.
Build that layer using AWS Step Functions, Amazon Bedrock Agents, and Apache Airflow. See demos of retry logic, human approvals, and graceful failure handling in the May 26 workshop.
How Claude Code works in large codebases: Best practices and where to start (5 minute read)
Claude Code is now being used in production across multiple large codebases in organizations with thousands of developers. These environments bring challenges that smaller codebases don't. This article covers patterns that Anthropic has seen that have led to the successful adoption of Claude Code at scale. It looks at how Claude Code has been used in monorepos with millions of lines, legacy systems built over decades, and microservices across separate repositories.
Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention (33 minute read)
KV-cache size, memory traffic, and attention cost quickly become the main constraints as reasoning models and agent workflows keep more tokens around for longer. LLM developers are adding a growing number of architecture tricks to reduce costs. Most of the changes look like small tweaks, but some are quite intricate design changes. This article looks at these architecture changes with a focus on what changes inside the transformer block, residual stream, KV cache, and attention computation.
Lighthouse Attention (11 minute read)
Lighthouse Attention, a selection-based hierarchical attention, offers up to 17x faster forward and backward passes than standard attention models at large contexts. It utilizes FlashAttention on a dense sub-sequence, maintaining efficiency and compatibility with upstream improvements. By enabling efficient long-context training and retaining dense model competence, Lighthouse Attention achieves 1.4x to 1.7x speedup in pretraining while reducing computational costs.
Notes on pretraining parallelisms and failed training runs (12 minute read)
Pretraining runs often fail. This article looks at all the ways that things can go wrong and why training is such a precarious operation. The key culprits seem to be breaking causality and adding bias.
The haves and have nots of the AI gold rush (1 minute read)
The AI boom has created a wealth divide, with an estimated 10,000 individuals from companies like OpenAI and Nvidia achieving over $20M in wealth, while others face uncertain futures with stagnant job prospects and layoffs. Software engineers express concerns about their skills becoming obsolete, raising anxiety about career paths. This disparity fuels tension in San Francisco's tech scene as some criticize the dual role of AI as a wealth source and a career threat.
Runway started by helping filmmakers β now it wants to beat Google at AI (11 minute read)
Runway's founders believe that the next form of AI will be built from video and world models that learn how the world works. The company is training models directly on observational data to reach the next frontier of AI. Runway was one of the first to develop AI video generation, but world models are a different race with deep-pocketed competitors. The company has raised $860 million to date, but it is going against incumbents like OpenAI and Google.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email