Incident Report: Spotify Outage on April 16, 2025
Read Full ArticleSummary
On April 16, 2025, Spotify experienced a significant outage due to a bug triggered by a change in the order of Envoy Proxy filters. This incident led to simultaneous crashes across all Envoy instances, compounded by a misconfiguration that caused continuous cycling of servers under Kubernetes management. The article outlines the timeline of events, the technical reasons behind the outage, and the steps taken to resolve the issue, including increasing server capacity and fixing configuration mismatches. The incident highlights the importance of careful change management and robust monitoring in maintaining service availability.
Key Learnings
- 1Understanding the impact of filter order changes in Envoy Proxy and how they can lead to system-wide failures.
- 2The significance of aligning Envoy's heap size with Kubernetes memory limits to prevent service disruptions.
- 3The role of client-side application retry logic in exacerbating load during outages and the need for effective load management strategies.
- 4The importance of transparent communication and accountability in incident management to foster trust and continuous improvement.
Who Should Read This
Senior DevOps Engineers managing high-availability systems using Envoy and Kubernetes
Test Your Knowledge
What are the potential risks associated with changing the order of filters in a proxy like Envoy?
How can misconfigurations in resource limits lead to cascading failures in a Kubernetes-managed environment?
What strategies can be employed to improve the rollout of configuration changes to minimize outage risks?
In what ways can client-side retry logic contribute to increased load during an outage, and how can this be managed?
What monitoring capabilities should be enhanced to detect issues like the one experienced during the outage?
Topics
More from Spotify Engineering
View Spotify engineering blogs →Background Coding Agents: Predictable Results Through Strong Feedback Loops (Part 3)
This article is the third part of a series detailing Spotify's exploration of background coding agents aimed at automating software maintenance. It highlights the challenges of ensuring reliable code...
Beyond Winning: Spotify’s Experiments with Learning Framework
The article outlines Spotify's development of the Confidence experimentation platform, which evolved from a focus on experiment velocity to prioritizing the quality and learning outcomes of...
1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1)
The article outlines Spotify's journey in enhancing developer productivity through the integration of AI coding agents into their Fleet Management system. By automating code transformations and...
Shuffle: Making Random Feel More Human
The article outlines Spotify's innovative approach to enhancing its Shuffle feature by addressing user feedback regarding the perceived randomness of song selections. By implementing a system called...
Background Coding Agents: Context Engineering (Part 2)
The article delves into the development and optimization of background coding agents at Spotify, particularly focusing on context engineering for these agents. It outlines the challenges encountered...