Incident Report: Spotify Outage on April 16, 2025

Summary

On April 16, 2025, Spotify experienced a significant outage due to a bug triggered by a change in the order of Envoy Proxy filters. This incident led to simultaneous crashes across all Envoy instances, compounded by a misconfiguration that caused continuous cycling of servers under Kubernetes management. The article outlines the timeline of events, the technical reasons behind the outage, and the steps taken to resolve the issue, including increasing server capacity and fixing configuration mismatches. The incident highlights the importance of careful change management and robust monitoring in maintaining service availability.

Key Learnings

1Understanding the impact of filter order changes in Envoy Proxy and how they can lead to system-wide failures.
2The significance of aligning Envoy's heap size with Kubernetes memory limits to prevent service disruptions.
3The role of client-side application retry logic in exacerbating load during outages and the need for effective load management strategies.
4The importance of transparent communication and accountability in incident management to foster trust and continuous improvement.

Who Should Read This

Senior DevOps Engineers managing high-availability systems using Envoy and Kubernetes

Test Your Knowledge

What are the potential risks associated with changing the order of filters in a proxy like Envoy?

How can misconfigurations in resource limits lead to cascading failures in a Kubernetes-managed environment?

What strategies can be employed to improve the rollout of configuration changes to minimize outage risks?

In what ways can client-side retry logic contribute to increased load during an outage, and how can this be managed?

What monitoring capabilities should be enhanced to detect issues like the one experienced during the outage?

Topics

Envoy Kubernetes Incident Management Load Shedding High Availability

Read Full Article at Spotify

More from Spotify Engineering

View Spotify engineering blogs →

Spotify

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Part 3)

This article is the third part of a series detailing Spotify's exploration of background coding agents aimed at automating software maintenance. It highlights the challenges of ensuring reliable code...

Spotify

15m

Incident Report: Spotify Outage on April 16, 2025

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More from Spotify Engineering

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Part 3)

Beyond Winning: Spotify’s Experiments with Learning Framework

1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1)

Shuffle: Making Random Feel More Human

Background Coding Agents: Context Engineering (Part 2)

Related topics