- InfraCoffee
- Posts
- What Brought Spotify Down on April 16, 2025?
What Brought Spotify Down on April 16, 2025?
How a Small Config Change Crashed Spotify’s Global Traffic – and What They Learned
On April 16, Spotify faced a major global outage that left most users unable to stream their favorite music for a few hours. So, what happened behind the scenes?
The Heart of the Issue: Envoy Proxy Chaos
Spotify uses Envoy Proxy at the edge of their network to route user traffic smartly across their global infrastructure. To make Envoy even more powerful, Spotify adds custom filters—like one for rate limiting. But one small change in the filter order turned out to have massive consequences.
The Chain Reaction
A “low-risk” reordering of Envoy filters triggered a hidden bug in one of those custom filters, crashing every Envoy instance globally—except in Asia Pacific, where traffic was low due to the time of day.
When these instances restarted, client-side retry logic kicked in, flooding the perimeter with traffic. But there was a misconfigured memory limit: Envoy's heap size was larger than what Kubernetes allowed. So, each instance kept crashing and restarting in an endless loop.
How They Fixed It
Spotify increased server capacity to reduce the memory pressure, which helped stabilize the perimeter and stop the crashing cycle.
Timeline Snapshot:
12:18 UTC: Filter change → Crash
12:20 UTC: Global traffic drops
14:20 UTC: Europe recovers
15:10 UTC: US recovers
15:40 UTC: Back to normal
What’s Next for Spotify?
Bug fix in the crashing filter
Fixed memory configuration
Smarter rollout process for config changes
Better monitoring to catch issues earlier
Takeaway:
Even small changes can trigger large-scale outages in complex distributed systems. Transparency like this from Spotify helps the whole tech community learn and improve.