Engineering posts about Service Level Objectives
Curated summaries and key learnings for engineers working with Service Level Objectives.
Code Orange: Fail Small is complete. The result is a stronger Cloudflare network
The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...
From Incident Counting to SLIs: How DigitalOcean Rethought Availability
The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...
A one-line Kubernetes fix that saved 600 hours a year
The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...
When protections outlive their purpose: A lesson on managing defense systems at scale
The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...
Code Orange: Fail Small — Our resilience plan following recent incidents
The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...
Pull request intervention for infrastructure-as-code risks with Bitbucket custom merge checks
The article discusses Atlassian's approach to mitigating risks associated with infrastructure-as-code through the implementation of Bitbucket custom merge checks. It highlights the importance of...
Cloudflare outage on December 5, 2025
On December 5, 2025, Cloudflare experienced a significant outage affecting a portion of its network due to a configuration change related to its Web Application Firewall (WAF). The incident, which...