Code Orange: Fail Small — Our resilience plan following recent incidents
Read Full ArticleSummary
The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan, emphasizing the need for controlled rollouts of configuration changes akin to software updates. The initiative focuses on improving failure modes across services, revising internal procedures to expedite emergency responses, and ensuring that all production systems adhere to Health Mediated Deployments (HMD) for configuration management. By adopting these strategies, Cloudflare aims to prevent future incidents and restore user trust.
Key Learnings
- 1Controlled rollouts for configuration changes can mitigate risks associated with rapid deployments.
- 2Understanding and addressing failure modes between services is crucial for maintaining system reliability.
- 3Improving access to tools during emergencies can significantly reduce resolution times.
- 4Regular training and updates to break glass procedures are essential for effective incident management.
- 5The integration of Health Mediated Deployments (HMD) for configuration management will enhance the overall resilience of production systems.
Who Should Read This
Senior Site Reliability Engineers focusing on incident management and resilience strategies in cloud infrastructure.
Test Your Knowledge
What are the trade-offs of implementing controlled rollouts for configuration changes compared to immediate global deployments?
How can the failure modes identified in the incidents inform future design decisions in Cloudflare's architecture?
What specific changes to break glass procedures could expedite incident resolution without compromising security?
In what ways can the Health Mediated Deployment (HMD) system be adapted for configuration management?
How does the concept of resilience engineering apply to the challenges faced during the recent outages?
Topics
More articles about Incident Management
Explore Incident Management engineering →Cloudflare outage on February 20, 2026
On February 20, 2026, Cloudflare experienced a significant outage affecting customers using its Bring Your Own IP (BYOIP) service due to a misconfiguration in the Border Gateway Protocol (BGP)...
2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults
The 2025 Q4 DDoS threat report by Cloudflare reveals a significant escalation in DDoS attacks, with a record-setting attack of 31.4 Tbps marking a year of unprecedented assaults. The report...
Route leak incident on January 22, 2026
On January 22, 2026, a misconfiguration in Cloudflare's routing policy led to a significant BGP route leak, affecting both Cloudflare customers and external networks. The incident, which lasted 25...
When protections outlive their purpose: A lesson on managing defense systems at scale
The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...
Securing the Grid: A Practical Guide to Cyber Analytics for Energy & Utilities
The article outlines the critical cybersecurity challenges faced by the Energy & Utilities sector, particularly due to the convergence of IT and operational technology (OT) systems. It emphasizes the...
More from Cloudflare Engineering
View Cloudflare engineering blogs →Complexity is a choice. SASE migrations shouldn’t take years.
The article emphasizes the shift in the cybersecurity landscape regarding SASE migrations, arguing that complexity is a choice rather than an inevitability. It showcases how Cloudflare's SASE...
Active defense: introducing a stateful vulnerability scanner for APIs
The article introduces Cloudflare's new stateful vulnerability scanner designed specifically for APIs, addressing the limitations of traditional defensive security measures. It highlights the...
Fixing request smuggling vulnerabilities in Pingora OSS deployments
The article addresses critical HTTP/1.x request smuggling vulnerabilities identified in the Pingora open source framework, particularly when deployed as an ingress proxy. It outlines the nature of...
From the endpoint to the prompt: a unified data security vision in Cloudflare One
The article outlines Cloudflare One's evolution in data security, emphasizing a unified approach that encompasses protection in transit, visibility and control at rest, and enforcement in use. It...
A QUICker SASE client: re-building Proxy Mode
The article outlines the challenges faced by security teams when implementing proxy modes in SASE environments, particularly the performance issues associated with traditional TCP implementations. It...