Cloudflare outage on November 18, 2025
Read Full ArticleSummary
On November 18, 2025, Cloudflare experienced a major outage due to a misconfiguration in their database permissions, which led to the generation of an oversized feature file for their Bot Management system. This caused HTTP 5xx errors across various services, including core CDN and security services, Workers KV, and Access. The incident was exacerbated by the propagation of bad configuration files and an initial misdiagnosis of a DDoS attack. The article outlines the timeline of events, the technical failures that led to the outage, and the subsequent recovery efforts, emphasizing the importance of robust configuration management and monitoring in cloud services.
Key Learnings
- 1Understanding how database permission changes can inadvertently affect application behavior and lead to outages.
- 2The critical role of configuration files in machine learning models and the potential for cascading failures when these files are not managed correctly.
- 3The importance of having a clear incident response plan and the need for accurate diagnostics to avoid misattributing outages to external threats.
- 4The necessity of implementing limits on resource usage to prevent system failures due to unexpected input sizes.
- 5The value of post-incident analysis in identifying weaknesses in system design and improving resilience against future outages.
Who Should Read This
Senior Cloud Engineers managing large-scale distributed systems and incident response teams.
Test Your Knowledge
What design decisions contributed to the propagation of the oversized feature file across Cloudflare's network?
How did the change in ClickHouse query behavior lead to the generation of duplicate feature rows?
What steps can be taken to prevent similar outages caused by configuration file issues in the future?
In what ways did the initial misdiagnosis of a DDoS attack affect the incident response and recovery process?
What are the implications of having a fixed-size limit on machine learning features in production systems?
Topics
More articles about AWS
Explore AWS engineering →Complexity is a choice. SASE migrations shouldn’t take years.
The article emphasizes the shift in the cybersecurity landscape regarding SASE migrations, arguing that complexity is a choice rather than an inevitability. It showcases how Cloudflare's SASE...
AWS Weekly Roundup: Amazon Connect Health, Bedrock AgentCore Policy, GameDay Europe, and more (March 9, 2026)
The article provides a comprehensive overview of recent updates and launches from AWS, highlighting innovations such as Amazon Connect Health, which offers AI-driven solutions for healthcare, and the...
Native .NET Buildpack Support is Now Available on App Platform
DigitalOcean has announced native .NET buildpack support on its App Platform, enabling developers to deploy .NET applications directly from a Git repository without the need for Dockerfiles. The...
Introducing OpenClaw on Amazon Lightsail to run your autonomous private AI agents
The article introduces OpenClaw, an autonomous private AI agent, now available on Amazon Lightsail. It details the process of launching an OpenClaw instance, which is pre-configured with Amazon...
See risk, fix risk: introducing Remediation in Cloudflare CASB
The article introduces a significant enhancement to Cloudflare's Cloud Access Security Broker (CASB) by launching a Remediation feature that allows users to directly fix risky file-sharing...
More from Cloudflare Engineering
View Cloudflare engineering blogs →Complexity is a choice. SASE migrations shouldn’t take years.
The article emphasizes the shift in the cybersecurity landscape regarding SASE migrations, arguing that complexity is a choice rather than an inevitability. It showcases how Cloudflare's SASE...
Active defense: introducing a stateful vulnerability scanner for APIs
The article introduces Cloudflare's new stateful vulnerability scanner designed specifically for APIs, addressing the limitations of traditional defensive security measures. It highlights the...
Fixing request smuggling vulnerabilities in Pingora OSS deployments
The article addresses critical HTTP/1.x request smuggling vulnerabilities identified in the Pingora open source framework, particularly when deployed as an ingress proxy. It outlines the nature of...
From the endpoint to the prompt: a unified data security vision in Cloudflare One
The article outlines Cloudflare One's evolution in data security, emphasizing a unified approach that encompasses protection in transit, visibility and control at rest, and enforcement in use. It...
A QUICker SASE client: re-building Proxy Mode
The article outlines the challenges faced by security teams when implementing proxy modes in SASE environments, particularly the performance issues associated with traditional TCP implementations. It...