Achieving High Availability with distributed database on Kubernetes at Airbnb
Read Full ArticleSummary
The article discusses Airbnb's innovative approach to achieving high availability in distributed databases deployed on Kubernetes. It outlines the challenges of managing stateful services in Kubernetes, particularly focusing on node replacement and upgrades. By categorizing node replacement events and implementing a custom Kubernetes operator, Airbnb ensures data consistency and availability during infrastructure changes. The use of multiple Kubernetes clusters across different AWS availability zones further enhances fault tolerance and minimizes the impact of failures. The article emphasizes the importance of leveraging AWS EBS for reliability and latency management, showcasing how open-source databases can thrive in cloud environments.
Key Learnings
- 1Implementing a custom Kubernetes operator can significantly improve the management of stateful services like databases.
- 2Categorizing node replacement events allows for better coordination and minimizes service disruption during upgrades and failures.
- 3Deploying databases across multiple Kubernetes clusters in different availability zones enhances fault tolerance and limits the blast radius of issues.
- 4Leveraging AWS EBS for rapid reattachment and durability can optimize database performance and availability.
- 5Utilizing stale reads and read timeouts can mitigate latency spikes and improve overall query performance.
Who Should Read This
Senior Site Reliability Engineers managing high-availability distributed systems on Kubernetes
Test Your Knowledge
What are the trade-offs of using Kubernetes for managing stateful services compared to traditional methods?
How does the categorization of node replacement events affect the overall availability of the database?
What design decisions led to the choice of deploying across multiple Kubernetes clusters in different availability zones?
In what scenarios might the custom Kubernetes operator fail, and how can those failures be mitigated?
Why is it important to implement read timeouts and stale reads in the context of AWS EBS latency?
Topics
More articles about High Availability
Explore High Availability engineering →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Best Practices for High QPS Model Serving on Databricks
The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
More from Airbnb Engineering
View Airbnb engineering blogs →It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb
The article outlines Airbnb's transformation of its Observability as Code (OaC) alert review process, which significantly reduced development cycles from weeks to minutes. By implementing a system...
Academic Publications & Airbnb Tech: 2025 Year in Review
The article discusses Airbnb's significant advancements in AI and machine learning throughout 2025, particularly in the context of academic conferences such as KDD, CIKM, and EMNLP. It highlights the...
Safeguarding Dynamic Configuration Changes at Scale
The article outlines Airbnb's dynamic configuration platform, Sitar, which enables safe and reliable runtime behavior changes without service interruptions. It emphasizes the importance of a coherent...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
Pay As a Local
The article outlines Airbnb's initiative to implement over 20 locally relevant payment methods across various global markets within a year. It details the architectural changes made to their payment...