Code Orange: Fail Small — Our resilience plan following recent incidents

Summary

The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan, emphasizing the need for controlled rollouts of configuration changes akin to software updates. The initiative focuses on improving failure modes across services, revising internal procedures to expedite emergency responses, and ensuring that all production systems adhere to Health Mediated Deployments (HMD) for configuration management. By adopting these strategies, Cloudflare aims to prevent future incidents and restore user trust.

Key Learnings

1Controlled rollouts for configuration changes can mitigate risks associated with rapid deployments.
2Understanding and addressing failure modes between services is crucial for maintaining system reliability.
3Improving access to tools during emergencies can significantly reduce resolution times.
4Regular training and updates to break glass procedures are essential for effective incident management.
5The integration of Health Mediated Deployments (HMD) for configuration management will enhance the overall resilience of production systems.

Who Should Read This

Senior Site Reliability Engineers focusing on incident management and resilience strategies in cloud infrastructure.

Test Your Knowledge

What are the trade-offs of implementing controlled rollouts for configuration changes compared to immediate global deployments?

How can the failure modes identified in the incidents inform future design decisions in Cloudflare's architecture?

What specific changes to break glass procedures could expedite incident resolution without compromising security?

In what ways can the Health Mediated Deployment (HMD) system be adapted for configuration management?

How does the concept of resilience engineering apply to the challenges faced during the recent outages?

Topics

Incident Management Resilience Engineering Failure Modes Service Level Objectives

Read Full Article at Cloudflare

More from Cloudflare Engineering

View Cloudflare engineering blogs →

Cloudflare

Complexity is a choice. SASE migrations shouldn’t take years.

The article emphasizes the shift in the cybersecurity landscape regarding SASE migrations, arguing that complexity is a choice rather than an inevitability. It showcases how Cloudflare's SASE...

Cloudflare

12m

Active defense: introducing a stateful vulnerability scanner for APIs

The article introduces Cloudflare's new stateful vulnerability scanner designed specifically for APIs, addressing the limitations of traditional defensive security measures. It highlights the...

Cloudflare

10m

Code Orange: Fail Small — Our resilience plan following recent incidents

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Incident Management

Cloudflare outage on February 20, 2026

2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults

Route leak incident on January 22, 2026

When protections outlive their purpose: A lesson on managing defense systems at scale

Securing the Grid: A Practical Guide to Cyber Analytics for Energy & Utilities

More from Cloudflare Engineering

Complexity is a choice. SASE migrations shouldn’t take years.

Active defense: introducing a stateful vulnerability scanner for APIs

Fixing request smuggling vulnerabilities in Pingora OSS deployments

From the endpoint to the prompt: a unified data security vision in Cloudflare One

A QUICker SASE client: re-building Proxy Mode

Related topics