Spotify
4 min read

Incident Report: Spotify Outage on April 16, 2025

Read Full Article

Summary

On April 16, 2025, Spotify experienced a significant outage due to a bug triggered by a change in the order of Envoy Proxy filters. This incident led to simultaneous crashes across all Envoy instances, compounded by a misconfiguration that caused continuous cycling of servers under Kubernetes management. The article outlines the timeline of events, the technical reasons behind the outage, and the steps taken to resolve the issue, including increasing server capacity and fixing configuration mismatches. The incident highlights the importance of careful change management and robust monitoring in maintaining service availability.

Key Learnings

  • 1Understanding the impact of filter order changes in Envoy Proxy and how they can lead to system-wide failures.
  • 2The significance of aligning Envoy's heap size with Kubernetes memory limits to prevent service disruptions.
  • 3The role of client-side application retry logic in exacerbating load during outages and the need for effective load management strategies.
  • 4The importance of transparent communication and accountability in incident management to foster trust and continuous improvement.

Who Should Read This

Senior DevOps Engineers managing high-availability systems using Envoy and Kubernetes

Test Your Knowledge

?

What are the potential risks associated with changing the order of filters in a proxy like Envoy?

?

How can misconfigurations in resource limits lead to cascading failures in a Kubernetes-managed environment?

?

What strategies can be employed to improve the rollout of configuration changes to minimize outage risks?

?

In what ways can client-side retry logic contribute to increased load during an outage, and how can this be managed?

?

What monitoring capabilities should be enhanced to detect issues like the one experienced during the outage?

Topics

Read Full Article at Spotify