TLDR DevOps 2026-04-29
GitHub Availability 📉, Cloud Cost Optimization ☁️, Autonomy Problem ✨
Bolting AI onto old service management playbooks? Time to shatter the service quo (Sponsor)
Service management workflows were designed for the 2010s. With AI, work happens in real-time across code deploys, SaaS apps, devices, and collaboration tools. So why is service still trapped in a queue?
In this whitepaper, Atlassian shares its AI-native vision for service management.
Topics include:
- Why old service management playbooks are failing in the AI era and forcing teams to reimagine experiences from the ground up.
- How Rovo and Teamwork Graph unlock smarter, context-aware AI and better, more proactive service experiences.
- How Service Collection can free your team from the relics of legacy service desks and help you shatter the service quo.
Read the whitepaper
How we built the most performant DeepSeek V3.2, MiniMax-M2.5 and Qwen 3.5 397B on DigitalOcean NVIDIA HGX™ B300 GPU Droplets (5 minute read)
DigitalOcean announced general availability of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B on its Serverless Inference platform, which achieved the fastest output speeds among all tested providers, with DeepSeek V3.2 delivering 230 tokens per second and sub-1-second time to first token for 10,000 input tokens. The performance was achieved through NVIDIA's HGX B300 GPUs with 288GB memory, NVFP4 quantization for 1.8x smaller memory footprint, and custom optimizations to the vLLM serving framework in collaboration with Inferact.
Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta) (3 minute read)
Kubernetes v1.36 promoted to beta the ability to modify CPU, memory, GPU, and other resource requests in suspended Jobs' pod templates, eliminating the need to delete and recreate Jobs when resource requirements change. The feature, enabled by default, lets queue controllers and administrators adjust resources before Jobs start running. It is particularly useful for batch and machine learning workloads where optimal allocation depends on current cluster conditions.
An update on GitHub availability (6 minute read)
GitHub says recent outages were caused by rapid growth in AI-driven development, which has pushed the platform beyond its current scaling limits. The company is prioritizing reliability by expanding capacity, isolating critical systems, and reducing single points of failure to handle the surge.
The Autonomy Problem: Why AI Agents Demand a New Security Playbook (4 minute read)
AI agents automate development and business tasks but introduce new risks like prompt injection, privilege escalation, and cascading failures that expand attack surfaces, prompting NIST concern. Effective mitigation requires layered controls across model design, system permissions, and human oversight to ensure secure deployment.
How it feels to run an incident with AI SRE (8 minute read)
This post describes incident.io's evolving AI SRE experience, which automates incident investigation, debugging, and resolution within a unified workflow, reducing context switching by integrating Slack, coding tools, and updates, and enabling rapid diagnosis, fixes, and reporting with minimal manual effort.
How GitHub uses eBPF to improve deployment safety (7 minute read)
GitHub mitigates circular deployment dependencies, where outages could block their own recovery, by using eBPF to monitor and restrict deployment scripts' network access and detect hidden, direct, and transient dependencies. This enables per-process control, DNS interception, and real-time auditing of risky calls like GitHub API usage during incident recovery.
Kubernetes for platform teams: Leveraging k0s and k0rdent (6 minute read)
This post demonstrates how to build a scalable, multi-cluster Kubernetes platform on OpenStack using k0s, k0rdent, and Hosted Control Planes (HCP), which eliminates the need for dedicated 3-node control planes per cluster by centralizing them in a single management cluster. The architecture shifts from managing individual clusters to operating a declarative system that handles provisioning, scaling, and upgrades across entire fleets while significantly reducing infrastructure costs and operational complexity.
How do you move code safely from one environment to the next? (Sponsor)
Deployment ≠ promotion. This blog, from the creators of ArgoCD, explains why promotion is the missing layer in GitOps stacks. Learn how healthy GitOps uses continuous promotion to govern movements between environments.
Read the blogFrom air-gapped to private cloud: Security that adapts to your environment (3 minute read)
Cloud-native security must adapt to diverse deployment constraints rather than enforce SaaS models, and Sysdig Secure delivers consistent runtime detection across private cloud, on-premises, and air-gapped environments with flexible, locally controlled implementations.
Ghostty Is Leaving GitHub (3 minute read)
Mitchell Hashimoto, cofounder of HashiCorp, has announced that he is moving the Ghostty project off GitHub after 18 years of deep personal and professional attachment, citing growing frustration and disappointment with the platform.
Cloud Cost Optimization: Principles that still matter (5 minute read)
Cloud cost optimization is a continuous, strategic practice of aligning usage with business value, made more critical by unpredictable, resource-intensive AI workloads that require strong visibility, governance, and iterative management.
Get our free daily newsletter with curated tools 💻, trends 📈, and insights 💡, for DevOps Engineers 👨💻
Join 340,000 readers for
one daily email