Engineering posts about Resilience Engineering
Curated summaries and key learnings for engineers working with Resilience Engineering.
Code Orange: Fail Small is complete. The result is a stronger Cloudflare network
The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...
From Incident Counting to SLIs: How DigitalOcean Rethought Availability
The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...
A one-line Kubernetes fix that saved 600 hours a year
The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...
Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG...
When protections outlive their purpose: A lesson on managing defense systems at scale
The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...
Code Orange: Fail Small — Our resilience plan following recent incidents
The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...
Welcoming Stately Cloud to Databricks: Investing in the Foundation for Scalable AI Applications
The article highlights Databricks' acquisition of Stately Cloud, emphasizing the importance of building a robust foundation for scalable AI applications. It discusses the expertise of the Stately...
Pull request intervention for infrastructure-as-code risks with Bitbucket custom merge checks
The article discusses Atlassian's approach to mitigating risks associated with infrastructure-as-code through the implementation of Bitbucket custom merge checks. It highlights the importance of...