TLDR AI 2025-12-23
Year with ChatGPT π
, OpenTinker π¨, GLM 4.7 π€
Delve Shipmas Day 2: AI Vendor Risk Management (Sponsor)
Welcome to Delve Shipmas. Every day this week, Delve is launching a new AI feature to make compliance more modern than ever. Yesterday, we launched Delve's AI Copilot that can help on any compliance task or question. Today, we're launching... AI Vendor Risk ManagementVendor risk management is painful. Dozens of vendors, endless questionnaires, and unclear risk levels. Delve's AI VRM handles this automatically by auto-gathering security evidence from each vendor, creating creating findings, and giving you a clear view of your risk posture. Leave spreadsheets and legacy platforms in the past and enter the future of risk with Delve AI.Join companies like 11x, micro1, Bland, Wisprflow and more.
Book a demo to see Delve in action and get $1,500 off with code DELVEXMAS2.
Z.AI launches GLM-4.7, new SOTA open-source model for coding (2 minute read)
GLM-4.7 is the latest release in Z.AI's General Language Model line. The high-end foundation model is aimed at advanced reasoning, coding, and multimodal workloads. The late update expands context handling and reasoning depth compared to earlier versions. It introduced upgraded reasoning pipelines and broader multimodal support.
Introducing Manus Design View (3 minute read)
Manus Design View is an extension of the Manus agent for seamless AI design workflows. It enables designers to generate concepts with simple prompts, make precise edits using the Mark Tool, and change text easily. The Manus mobile app allows design edits on-the-go with either text or voice input. It is available to all users today.
Your Year with ChatGPT (2 minute read)
OpenAI introduced a personalized year-in-review feature called "Your Year with ChatGPT," available to eligible users in select regions. Inspired by Spotify Wrapped, the feature highlights individual usage trends from the past year.
MiniMax M2.1 is live in Kilo (3 minute read)
MiniMax M2.1 is ahead of DeepSeek and Kimi on several benchmarks. It is even catching up to state-of-the-art models in some areas. The model is super-fast and efficient. It is now available to all Kilo Code users.
π§
Deep Dives & Analysis
Hardening Atlas Against Prompt Injection (13 minute read)
This post details OpenAI's ongoing efforts to secure its AI browser, Atlas, against prompt injection attacks - malicious instructions embedded in web content that manipulate agent behavior. While mitigation techniques are improving, the company acknowledged prompt injection remains an unsolved and persistent threat, particularly as agent capabilities expand on the open web.
We removed 80% of our agent's tools (4 minute read)
Vercel spent months building a sophisticated internal text-to-SQL agent with specialized tools, heavy prompt engineering, and careful context management. It kind of worked, but it was fragile, slow, and required constant maintenance. The team then deleted most of it and stripped the agent down to a single tool that executed arbitrary bash commands. Its agent got simpler and better at the same time: it had a 100% success rate instead of 80%.
What (I think) makes Gemini 3 Flash so good and fast (8 minute read)
Gemini 3 Flash is a lightweight, efficient model optimized for speed and low latency. It is capable of delivering performance comparable to Gemini 3 Pro at a fraction of the cost. The model's design brings unprecedented power but introduces specific tradeoffs in token efficiency and reliability. This post takes a look at the leaked architectural details of the new model.
Async Coding Agents "From Scratch" (10 minute read)
It's pretty easy to homebrew your own asynchronous coding agent. This means that businesses selling coding agents can no longer differentiate themselves by only running sandboxed agents in the cloud that connect to Slack. Companies working on coding agents likely realize this and are doing everything they can to train their own SWE agents and auxiliary models to improve their harnesses.
π¨βπ»
Engineering & Research
LLM code quality leaderboard: How does your preferred model score? (Sponsor)
Interested to learn how different LLMs perform for coding?
New research on models like GPT-5.2 High and Gemini 3.0 Pro reveals trade-offs in structural quality and security.
Learn more about the reliability, security, and maintainability of code written by the latest models with Sonar's LLM Leaderboardβthe definitive resource to understand the true quality of AI-generated code.
Scientific Intelligence Benchmark (GitHub Repo)
SGI-Bench is a benchmark for assessing Scientific General Intelligence across the entire research cycle, such as Deliberation, Conception, Action, and Perception. It spans 10 disciplines with over 1,000 expert-curated tasks inspired by major open scientific questions.
Agent Skills for Context Engineering (GitHub Repo)
This repository contains a comprehensive collection of Agent Skills for building production-grade AI agent systems. They are categorized into Foundational skills, Architectural skills, and Operational skills. Each skill is structured for efficient context use. The patterns work on any agent platform that supports skills or allows custom instructions.
OpenTinker (GitHub Repo)
OpenTinker is an RL-as-a-Service infrastructure for foundation models. It features separation of programming and execution, separation of environment and training code, and seamless transition from training to inference. The platform enables users to perform RL training and inference without requiring local GPU resources by separating client-side programming from server-side execution. It provides a high-level Python API that abstracts away the complexity of distributed systems.
Cursor Expands Agent Hooks (3 minute read)
Cursor has announced partnerships to integrate its agent hook system with security and platform vendors. These hooks allow organizations to observe, modify, or block stages of the agent loop, supporting use cases like governance, dependency scanning, secrets management, and agent safety.
The Shape of Artificial Intelligence (33 minute read)
AI's utility in the coming decade will come from understanding the technology's strengths and where it can be used to augment human ability. It won't replace humans, at least in the short term, because we are too complex. However, the technology will eventually conquer territories we thought were exclusively ours. This will be the first time we will face true otherness, a new species of being.
Get the most interesting AI stories and breakthroughs delivered in a free daily email.
Join 920,000 readers for
one daily email