Cloudflare
15 min read

Cloudflare outage on November 18, 2025

Read Full Article

Summary

On November 18, 2025, Cloudflare experienced a major outage due to a misconfiguration in their database permissions, which led to the generation of an oversized feature file for their Bot Management system. This caused HTTP 5xx errors across various services, including core CDN and security services, Workers KV, and Access. The incident was exacerbated by the propagation of bad configuration files and an initial misdiagnosis of a DDoS attack. The article outlines the timeline of events, the technical failures that led to the outage, and the subsequent recovery efforts, emphasizing the importance of robust configuration management and monitoring in cloud services.

Key Learnings

  • 1Understanding how database permission changes can inadvertently affect application behavior and lead to outages.
  • 2The critical role of configuration files in machine learning models and the potential for cascading failures when these files are not managed correctly.
  • 3The importance of having a clear incident response plan and the need for accurate diagnostics to avoid misattributing outages to external threats.
  • 4The necessity of implementing limits on resource usage to prevent system failures due to unexpected input sizes.
  • 5The value of post-incident analysis in identifying weaknesses in system design and improving resilience against future outages.

Who Should Read This

Senior Cloud Engineers managing large-scale distributed systems and incident response teams.

Test Your Knowledge

?

What design decisions contributed to the propagation of the oversized feature file across Cloudflare's network?

?

How did the change in ClickHouse query behavior lead to the generation of duplicate feature rows?

?

What steps can be taken to prevent similar outages caused by configuration file issues in the future?

?

In what ways did the initial misdiagnosis of a DDoS attack affect the incident response and recovery process?

?

What are the implications of having a fixed-size limit on machine learning features in production systems?

Topics

Read Full Article at Cloudflare