TLDR DevOps 2026-05-11
Maintaining AI Code 🔮, Idempotency in Distrubted Systems 🧱, AgentMemory 🧠
Why dev teams outgrow their first CI (Sponsor)
Most engineering teams start on GitHub Actions or Jenkins.
Then the monorepo gets real, agents 5-10-50x the commit volume... and a flaky test sparks a Friday-afternoon outage.
Shopify, Pinterest, Block, Airbnb, OpenAI and Canva all run their CI on Buildkite.
We're built for the teams that need control over what runs where.
Try the lot for 30 days, zero commitment, no credit card. There's a real engineer named Ola on standby if you get stuck.
See what's included →
Introducing ARFBench: A time series question-answering benchmark based on real incidents (7 minute read)
Datadog introduced ARFBench, a real incident-based benchmark for evaluating AI on time series reasoning, showing current models lag experts, while a hybrid TSFM-VLM improves performance, and combined model expert approaches achieve near superhuman results.
Kubernetes v1.36: Moving Volume Group Snapshots to GA (3 minute read)
Kubernetes v1.36 brought volume group snapshots to General Availability, enabling users to take crash-consistent snapshots of multiple volumes simultaneously at the same point-in-time without requiring application quiescence. The feature, which progressed from Alpha in v1.27 to GA in v1.36, works exclusively with CSI volume drivers and uses label selectors to group PersistentVolumeClaim objects for coordinated snapshotting and restoration.
How Vault Secrets Operator (VSO) automates secret management for enterprises on Kubernetes (9 minute read)
Kubernetes native secrets lack enterprise lifecycle management, leading organizations to adopt centralized solutions like Vault, where the Vault Secrets Operator is now the recommended Kubernetes native approach for automated, scalable, and secure secret delivery, rotation, and governance across environments.
Idempotency Is Easy Until the Second Request Is Different (25 minute read)
Idempotency is not solved by simply storing an idempotency key. The hard cases start when retries arrive concurrently, after partial failures, after downstream side effects, or with the same key but different request content. A robust design must remember the scoped operation, canonical command, execution state, replay contract, expiry policy, and recovery path so the server can replay, reject, or reconcile instead of accidentally duplicating side effects.
You Need AI That Reduces Maintenance Costs (7 minute read)
AI coding agents only create lasting productivity gains if they reduce maintenance costs in proportion to how much faster they help teams produce code. Otherwise, the speed boost is temporary while the added maintenance burden compounds over time, eventually leaving teams worse off than before.
Scaling data and AI with Managed Service for Apache Airflow (4 minute read)
Managed Service for Apache Airflow introduces AI-driven orchestration enhancements, including Airflow 3.1 GA, embedded troubleshooting agents, declarative YAML-based pipeline automation, and an MCP server, enabling scalable, accessible, and efficient data pipeline management for AI and MLOps workloads.
How Discord Automates ScyllaDB Clusters at Scale (9 minute read)
Discord's Persistence Infrastructure team built the Scylla Control Plane (SCP), an automation framework that reduced the time to stand up a full production replica database cluster from 36 hours of manual work to under 2 hours of mostly hands-off operation. The tool uses a layered system of tasks, workflows, and jobs written in Rust with YAML configuration to automate complex database operations across hundreds of ScyllaDB nodes, including automatic retries, state tracking via SQLite, and intelligent error handling that distinguishes between recoverable issues and critical failures requiring human intervention.
Bootstrapping Flux with Terraform, the right way (5 minute read)
A new Terraform module enables bootstrapping the Flux operator on Kubernetes, then cleanly hands ownership to Flux for ongoing reconciliation, avoiding Terraform drift conflicts. It supports secure secret handling outside state, single-repo GitOps workflows, and ordered cluster bootstrapping, including prerequisites like CNI, before full Flux control.
Get our free daily newsletter with curated tools 💻, trends 📈, and insights 💡, for DevOps Engineers 👨💻
Join 340,000 readers for
one daily email