Airbnb

•

7 min read

•July 28, 2025

Achieving High Availability with distributed database on Kubernetes at Airbnb

Summary

The article discusses Airbnb's innovative approach to achieving high availability in distributed databases deployed on Kubernetes. It outlines the challenges of managing stateful services in Kubernetes, particularly focusing on node replacement and upgrades. By categorizing node replacement events and implementing a custom Kubernetes operator, Airbnb ensures data consistency and availability during infrastructure changes. The use of multiple Kubernetes clusters across different AWS availability zones further enhances fault tolerance and minimizes the impact of failures. The article emphasizes the importance of leveraging AWS EBS for reliability and latency management, showcasing how open-source databases can thrive in cloud environments.

Key Learnings

1Implementing a custom Kubernetes operator can significantly improve the management of stateful services like databases.
2Categorizing node replacement events allows for better coordination and minimizes service disruption during upgrades and failures.
3Deploying databases across multiple Kubernetes clusters in different availability zones enhances fault tolerance and limits the blast radius of issues.
4Leveraging AWS EBS for rapid reattachment and durability can optimize database performance and availability.
5Utilizing stale reads and read timeouts can mitigate latency spikes and improve overall query performance.

Who Should Read This

Senior Site Reliability Engineers managing high-availability distributed systems on Kubernetes

Test Your Knowledge

What are the trade-offs of using Kubernetes for managing stateful services compared to traditional methods?

How does the categorization of node replacement events affect the overall availability of the database?

What design decisions led to the choice of deploying across multiple Kubernetes clusters in different availability zones?

In what scenarios might the custom Kubernetes operator fail, and how can those failures be mitigated?

Why is it important to implement read timeouts and stale reads in the context of AWS EBS latency?

Topics

High Availability Node Replacement Kubernetes Distributed Database Fault Tolerance

Read Full Article at Airbnb

More from Airbnb Engineering

View Airbnb engineering blogs →

Airbnb

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

The article outlines Airbnb's transformation of its Observability as Code (OaC) alert review process, which significantly reduced development cycles from weeks to minutes. By implementing a system...

Airbnb

12m

Academic Publications & Airbnb Tech: 2025 Year in Review

The article discusses Airbnb's significant advancements in AI and machine learning throughout 2025, particularly in the context of academic conferences such as KDD, CIKM, and EMNLP. It highlights the...

Airbnb

Safeguarding Dynamic Configuration Changes at Scale

The article outlines Airbnb's dynamic configuration platform, Sitar, which enables safe and reliable runtime behavior changes without service interruptions. It emphasizes the importance of a coherent...

Airbnb

My Journey to Airbnb — Anna Sulkina

Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...

Airbnb

13m

Pay As a Local

The article outlines Airbnb's initiative to implement over 20 locally relevant payment methods across various global markets within a year. It details the architectural changes made to their payment...

Achieving High Availability with distributed database on Kubernetes at Airbnb

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about High Availability

Scaling Jira cloud Migrations, One Bottleneck at a Time

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Best Practices for High QPS Model Serving on Databricks

My Journey to Airbnb — Anna Sulkina

More from Airbnb Engineering

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Academic Publications & Airbnb Tech: 2025 Year in Review

Safeguarding Dynamic Configuration Changes at Scale

My Journey to Airbnb — Anna Sulkina

Pay As a Local

Related topics