Seamless Istio Upgrades at Scale
Read Full ArticleSummary
The article outlines Airbnb's approach to upgrading Istio across a vast infrastructure that includes tens of thousands of pods and multiple Kubernetes clusters. It emphasizes the importance of maintaining high availability during upgrades and describes the architectural setup that allows for seamless transitions between Istio versions. The upgrade process is built around a canary release model, enabling gradual rollouts and minimizing risk through controlled deployments. The use of an internal mutation framework, Krispr, allows for decoupling workload deployments from infrastructure upgrades, ensuring that teams can manage their workloads independently while still adhering to upgrade protocols.
Key Learnings
- 1Airbnb employs a canary upgrade model for Istio, allowing multiple versions to run simultaneously and ensuring seamless transitions.
- 2The architecture includes a management cluster for configuration and multiple workload clusters, facilitating independent upgrades without downtime.
- 3Krispr, an internal mutation framework, automates the injection of Istio revision labels into workloads, decoupling infrastructure upgrades from workload deployments.
- 4Gradual rollouts and the ability to control workload upgrades independently help mitigate risks associated with large-scale upgrades.
- 5Monitoring and managing workloads through a central controller ensures that upgrades are performed safely and within defined timeframes.
Who Should Read This
Senior Site Reliability Engineers managing large-scale Kubernetes environments with a focus on service mesh upgrades
Test Your Knowledge
What are the trade-offs of using a canary release model for Istio upgrades in a large-scale environment?
How does Krispr facilitate the decoupling of workload deployments from infrastructure upgrades, and what are the implications of this design choice?
What failure scenarios could arise during the Istio upgrade process, and how are they mitigated?
Why is maintaining high availability critical during the upgrade of foundational infrastructure like Istio?
How does the architecture of Airbnb's Istio deployment support independent upgrades across various teams and workloads?
Topics
More articles about High Availability
Explore High Availability engineering →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Best Practices for High QPS Model Serving on Databricks
The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
More from Airbnb Engineering
View Airbnb engineering blogs →It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb
The article outlines Airbnb's transformation of its Observability as Code (OaC) alert review process, which significantly reduced development cycles from weeks to minutes. By implementing a system...
Academic Publications & Airbnb Tech: 2025 Year in Review
The article discusses Airbnb's significant advancements in AI and machine learning throughout 2025, particularly in the context of academic conferences such as KDD, CIKM, and EMNLP. It highlights the...
Safeguarding Dynamic Configuration Changes at Scale
The article outlines Airbnb's dynamic configuration platform, Sitar, which enables safe and reliable runtime behavior changes without service interruptions. It emphasizes the importance of a coherent...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
Pay As a Local
The article outlines Airbnb's initiative to implement over 20 locally relevant payment methods across various global markets within a year. It details the architectural changes made to their payment...