Engineering posts about Incident Management

Curated summaries and key learnings for engineers working with Incident Management.

Databricks
5m

How security teams can report cyber risk to boards

The article outlines the importance of translating cyber risk into financial terms to enable boards to make informed decisions regarding security investments. It emphasizes the need for coherent risk...

Duolingo
5m

Triage, ship, debug—all from Slack

Duolingo has developed an AI-powered Slack app that integrates with various tools such as GitHub, Jenkins, and AWS to enhance developer productivity and streamline incident management. The app...

Databricks
4m

Why AI Security Infrastructure is Now a CMO Priority

The article emphasizes the critical role of AI security infrastructure in modern enterprises, particularly highlighting the launch of Databricks Lakewatch, an innovative security information and...

Cloudflare
11m

When DNSSEC goes wrong: how we responded to the .de TLD outage

The article discusses the DNSSEC outage affecting the .de TLD on May 5, 2026, when DENIC published incorrect DNSSEC signatures, leading to widespread SERVFAIL responses from validating resolvers. It...

Cloudflare
11m

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...

Databricks
4m

Alert Fatigue Is a Business Risk

The article highlights the critical issue of alert fatigue in enterprise security operations, where the overwhelming volume of alerts leads to significant risks as analysts struggle to prioritize and...

DigitalOcean
13m

From Incident Counting to SLIs: How DigitalOcean Rethought Availability

The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...

Meta (Facebook)
1m

Trust But Canary: Configuration Safety at Scale

In the Meta Tech Podcast episode featuring Pascal Hartig, the discussion revolves around the strategies employed by Meta's Configurations team to ensure safe configuration rollouts at scale. The...

Cloudflare
8m

A one-line Kubernetes fix that saved 600 hours a year

The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...

Databricks
8m

Databricks Announces Lakewatch: New Open, Agentic SIEM

Databricks has introduced Lakewatch, an innovative open security information and event management (SIEM) solution designed to address the limitations of traditional SIEMs, particularly in the context...

Airbnb
12m

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

The article outlines Airbnb's transition from a vendor-managed observability platform to a custom in-house solution built on open-source technology, specifically Prometheus. It details the challenges...

DigitalOcean
8m

Scaling Autonomous Site Reliability Engineering: Architecture, Orchestration, and Validation for a 90,000+ Server Fleet

The article discusses the implementation of an AI-powered Site Reliability Engineer (SRE) agent at Cloudways, which manages a fleet of over 90,000 servers. It outlines the architecture involving an...

Cloudflare
12m

Building a security overview dashboard for actionable insights

The article presents a comprehensive overview of a newly developed security dashboard designed to enhance the efficiency of security teams by providing actionable insights rather than mere...

Cloudflare
12m

Investigating multi-vector attacks in Log Explorer

The article discusses the complexities of modern multi-vector attacks in cybersecurity, emphasizing the necessity for comprehensive visibility through tools like Cloudflare Log Explorer. It outlines...

Cloudflare
15m

Cloudflare outage on February 20, 2026

On February 20, 2026, Cloudflare experienced a significant outage affecting customers using its Bring Your Own IP (BYOIP) service due to a misconfiguration in the Border Gateway Protocol (BGP)...

Cloudflare
9m

2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults

The 2025 Q4 DDoS threat report by Cloudflare reveals a significant escalation in DDoS attacks, with a record-setting attack of 31.4 Tbps marking a year of unprecedented assaults. The report...

Cloudflare
9m

Route leak incident on January 22, 2026

On January 22, 2026, a misconfiguration in Cloudflare's routing policy led to a significant BGP route leak, affecting both Cloudflare customers and external networks. The incident, which lasted 25...

GitHub
6m

When protections outlive their purpose: A lesson on managing defense systems at scale

The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...

Databricks
9m

Securing the Grid: A Practical Guide to Cyber Analytics for Energy & Utilities

The article outlines the critical cybersecurity challenges faced by the Energy & Utilities sector, particularly due to the convergence of IT and operational technology (OT) systems. It emphasizes the...

Cloudflare
11m

Code Orange: Fail Small — Our resilience plan following recent incidents

The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...