On 18 November 2025, a major disruption at Cloudflare — one of the world’s most important internet infrastructure companies — ripple-effected across a large swathe of the web. Websites and services powered by Cloudflare began returning “internal server error” pages, 5xx HTTP status codes, and other errors, temporarily taking down popular platforms like ChatGPT, X (formerly Twitter), and many others.(Reuters)
This incident exposed just how much of the modern internet depends on a few critical providers, and highlighted the risks and trade-offs of centralized infrastructure. Here’s a detailed breakdown of what went wrong, how Cloudflare responded, who was affected, and what this means going forward.
What Happened: Timeline & Root Cause
1. The Outage Begins
- At 11:20 UTC, Cloudflare’s network started having significant failures.(The Cloudflare Blog)
- Users began seeing error pages when trying to access websites. These included 5xx status codes — a sign of server-side failures.(The Cloudflare Blog)
- Initially, some thought it might be a DDoS attack because of the sudden spike in “unusual traffic.”(The Guardian)
2. The Technical Root Cause
- The problem stemmed from a Bot Management system component at Cloudflare. That system uses a “feature file”: a configuration file with many “features” that the machine learning model uses to score traffic as bot or human.(The Cloudflare Blog)
- A change in Cloudflare’s ClickHouse database permissions caused a query to generate duplicate rows in this feature file.(The Cloudflare Blog)
- As a result, the feature file became much larger than expected — more than double its normal size.(The Cloudflare Blog)
- But the proxy software in Cloudflare’s network had a strict limit: it expected no more than ~200 features. With the bloated file, it hit that limit, caused internal panic (software error), and began returning 5xx errors.(The Cloudflare Blog)
- Because this file is refreshed every five minutes, sometimes a “good” (normal) version and a “bad” (oversized) version would get propagated, leading to intermittent recovery and then failure.(The Cloudflare Blog)
- Eventually, all nodes in their network were serving the bad version, and the system stabilized in a failing state.(The Cloudflare Blog)
3. The Fix
- At about 14:30 UTC, Cloudflare stopped propagating the faulty file.(The Cloudflare Blog)
- They manually restored a known-good version of the feature file and forced a restart of their core proxy systems.(The Cloudflare Blog)
- By 17:06 UTC, error rates had returned to normal and core traffic was flowing again.(The Cloudflare Blog)
- Some residual recovery work continued (for example, restarting services, clearing bad states) for a few more hours.(The Cloudflare Blog)
4. What Went Wrong Beyond the Immediate Error
- Cloudflare’s Turnstile (their CAPTCHA / verification widget) failed.(The Cloudflare Blog)
- Workers KV (a key-value data store used by many developers) saw a spike in 5xx errors.(The Cloudflare Blog)
- Their Dashboard was partly inaccessible: many users couldn’t log in because Turnstile was down.(The Cloudflare Blog)
- Access (authentication) systems failed for many: logins didn’t succeed, and connection attempts were blocked or returned errors.(The Cloudflare Blog)
- Email-security features were impacted: for a time, Cloudflare lost access to a reputation data source, reducing spam-detection fidelity.(The Cloudflare Blog)
- There was also increased latency across the CDN, because debugging and observability systems were using a lot of CPU to track the error conditions.(The Cloudflare Blog)
Who Was Impacted & How It Showed Up
The outage was global and affected a wide set of Cloudflare-backed services. Some of the reported and confirmed impacts:
- ChatGPT (OpenAI): Users couldn’t access the service; many got “internal server error” pages.(Reuters)
- X (Twitter): Widely disrupted, with many users reporting downtime or error messages.(The Washington Post)
- Canva, Grindr, and other SaaS / web platforms also reported issues.(Reuters)
- DownDetector (a site that aggregates outage reports) itself was affected: being partially disrupted, it made monitoring the scale of the outage more difficult.(Outlook Business)
- Some government sites and organizations were impacted — The Financial Times reported that MI5 and the UK’s Financial Conduct Authority saw Cloudflare-related problems.(Financial Times)
- Cloudflare’s own services: their dashboard became unreliable, login flows were disrupted, and its internal APIs were under strain.(The Cloudflare Blog)
Why This Outage Is Such a Big Deal
- Concentration Risk
- Cloudflare powers a huge chunk of the web — it’s not just a CDN, but also a security layer (DDoS protection, bot management, WAF).(The Guardian)
- When Cloudflare went down, it wasn’t a few edge websites — entire services that depend on it (for safety and speed) began returning server errors.
- Invisible Infrastructure
- Many users don’t realize how “invisible” Cloudflare is: you go to a website, but behind the scenes, requests are going through Cloudflare’s network. When that network falters, even the website owner may struggle to figure out what’s wrong immediately.
- This outage highlighted how dependency on third-party infrastructure can cascade: a provider-level issue affects not just that provider, but everyone who builds on it.
- Internal System Complexity
- The root cause was not an external attack — it was an internal misconfiguration / “feature file” bug. That shows how even mature infrastructure companies can run into serious trouble through configuration mistakes or database permission changes.
- The fact that the system regenerates this “feature file” every five minutes meant that the faulty config was continuously propagated, making the incident harder to stabilize.
- Recovery & Transparency
- Cloudflare deployed a fix relatively quickly (within a few hours) and was transparent about what happened.(The Cloudflare Blog)
- But there’s a long tail: recovering from a global outage means not just restoring traffic, but also fixing internal state, restarting services, and verifying that all systems are healthy again.
Lessons & Take-Home Messages
- Infrastructure redundancy matters: For mission-critical services, relying on just one CDN / security provider is a risk. Multi-CDN strategies or fallback mechanisms (even if costlier) can reduce exposure.
- Rigorous change management: When you’re managing configuration files that feed into critical systems (especially machine-learning-based decision modules), you need strong validation, schema checks, and upper-bound limits.
- Observability & kill switches: Cloudflare itself is already talking about hardening its system: more robust ingestion checks for feature files, global kill-switches for problematic features, and better fallback behavior in case of config failures.(The Cloudflare Blog)
- User trust & communication: When core internet infrastructure fails, users may panic. Transparent communication (status pages, incident blog) helps rebuild trust.
- Decentralization trade-offs: While centralized, large-scale infrastructure providers bring huge benefits (performance, scale, security), this incident is a reminder that they’re a single point of failure. Organizations building on top of them should think about resilience.
What’s Next for Cloudflare & the Internet
- Cloudflare’s internal review: According to their post-mortem, they plan to “harden ingestion of Cloudflare-generated configuration files,” add kill-switches, and prevent similar cascading failures in future.(The Cloudflare Blog)
- User behavior change: Service owners may reassess their dependency on Cloudflare, possibly exploring fallback / alternate CDNs.
- Industry lessons: Other CDN / cloud providers will almost certainly take notice; the event may spark broader conversation about how to build more resilient web infrastructure.
- Transparency push: Incidents like these fuel demand for stronger SLAs, better status reporting, and more open communication from infrastructure vendors.
Final Thoughts
This was not a minor outage. For several hours, a large chunk of the internet using Cloudflare for security and delivery began serving 5xx errors. That scenario underscores a vital truth: the internet may feel decentralized, but much of its backbone relies on a handful of big infrastructure providers. When one of those providers falters, the effects are felt widely and quickly.
Cloudflare’s mistake here was not malicious — it was a configuration / design failure. But its impact was real, and the recovery was hard-fought. The silver lining is that Cloudflare’s postmortem shows they’re taking the incident seriously, fixing core systems, and thinking deeply about how to prevent such a failure again.
The bigger lesson belongs to all of us in tech: we need to design for failure, and we need to treat critical infrastructure with the respect (and redundancy) it deserves.

