TLDR AI 2025-08-08
OpenAI GPT-5 5️⃣, Cursor CLI 💻, Grok ads 📰
METR's Evaluation of GPT-5 (44 minute read)
METR assessed whether OpenAI GPT-5 could pose catastrophic risks before it was externally deployed. This post provides detailed findings from METR's assessment. METR determined that GPT-5 does not have the prerequisite capabilities - by a large margin - to pose catastrophic risk.
Vibe Check on GPT-5 (13 minute read)
GPT-5 excels as a daily driver for most users, with API pricing that aggressively undercuts competitors by up to 12x. However, it is too cautious for writing feedback or for autonomous coding workflows, making it feel like a significant upgrade to an old paradigm rather than a leap forward for developers already using multi-agent tools like Claude Code.
GPT-5 Hands-On: Welcome to the Stone Age (3 minute read)
GPT-5 marks the "stone age" for AI. It doesn't just use tools - it thinks with them, like humans first learning to shape stones changed everything. Testing showed it one-shotted dependency conflicts that stumped every other model by using yarn commands like Deep Research uses web search, iterating and reasoning through problems instead of just guessing. While it's worse at writing than GPT-4.5 (producing more "LinkedIn slop"), it's unequivocally the best coding model - creating entire production-ready websites with SQLite databases in one shot, where other models gave scaffolding or plans, automating software engineering from maybe 65% to 72% in one leap.
👨💻
Engineering & Research
Momentic: AI that makes testing effortless (Sponsor)
YC-backed Momentic uses AI to automate web app testing. Write tests in Plain English ("the login button should be visible") and let AI execute the test. Join hyper-growth companies like Retool, Notion, Webflow, and 100s of others using Momentic.
See how it worksOcto (GitHub Repo)
Octo is a very friendly open-source coding helper. It works with any OpenAI- or Anthropic-compatible LLM API. Octo allows developers to switch models at will mid-conversation when a particular model gets stuck. Users have the option to use custom-trained models to automatically handle tool call and code edit features from the main models. Octo has zero telemetry.
Notte (GitHub Repo)
Notte is a web agent framework built for speed, cost-efficiency, scale, and reliability. It allows developers to rapidly build reliable web automation agents. Notte provides all the essential tools for building and deploying AI agents that interact seamlessly with the web. The full-stack framework combines AI agents with traditional scripting for maximum efficiency. It enables users to develop, deploy, and scale their own agents and web automations within a single API.
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training (23 minute read)
The technique uses two reward parameters during training: safety constraints that penalize policy violations by severity, and helpfulness maximization that rewards both direct compliance and informative refusals with safe alternatives. In tests, GPT-5 with safe-completions achieved higher safety scores than o3 on dual-use prompts while providing substantially more helpful responses, and when failures did occur, they were significantly less severe.
Achieving 10,000x training data reduction with high-fidelity labels (11 minute read)
Identifying policy-violating content requires solutions capable of deep contextual and cultural understanding, an area that large language models (LLMs) excel at over traditional machine learning systems. However, fine-tuning models for such complex tasks requires high-fidelity training data that is difficult and expensive to curate at the necessary quality and scale. This post describes a scalable curation process for active learning that drastically reduces the amount of training data needed for fine-tuning LLMs while significantly improving model alignment with human experts. In experiments, the process reduced the scale of training data needed from 100,000 to under 500 training examples while increasing model alignment with human experts by up to 65%.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email