Cloudflare
11 min read

Code Orange: Fail Small — Our resilience plan following recent incidents

Read Full Article

Summary

The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan, emphasizing the need for controlled rollouts of configuration changes akin to software updates. The initiative focuses on improving failure modes across services, revising internal procedures to expedite emergency responses, and ensuring that all production systems adhere to Health Mediated Deployments (HMD) for configuration management. By adopting these strategies, Cloudflare aims to prevent future incidents and restore user trust.

Key Learnings

  • 1Controlled rollouts for configuration changes can mitigate risks associated with rapid deployments.
  • 2Understanding and addressing failure modes between services is crucial for maintaining system reliability.
  • 3Improving access to tools during emergencies can significantly reduce resolution times.
  • 4Regular training and updates to break glass procedures are essential for effective incident management.
  • 5The integration of Health Mediated Deployments (HMD) for configuration management will enhance the overall resilience of production systems.

Who Should Read This

Senior Site Reliability Engineers focusing on incident management and resilience strategies in cloud infrastructure.

Test Your Knowledge

?

What are the trade-offs of implementing controlled rollouts for configuration changes compared to immediate global deployments?

?

How can the failure modes identified in the incidents inform future design decisions in Cloudflare's architecture?

?

What specific changes to break glass procedures could expedite incident resolution without compromising security?

?

In what ways can the Health Mediated Deployment (HMD) system be adapted for configuration management?

?

How does the concept of resilience engineering apply to the challenges faced during the recent outages?

Topics

Read Full Article at Cloudflare