GitHub
6 min read

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Read Full Article

Summary

The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered Elasticsearch setup to a more robust architecture utilizing Cross Cluster Replication (CCR). This change allows for independent single-node Elasticsearch clusters, improving data replication and reducing the risk of locked states during maintenance. The article outlines the challenges faced with previous Elasticsearch integrations, particularly in maintaining a leader/follower pattern, and details the new workflows developed to support the CCR feature, ensuring that critical data remains accessible and durable.

Key Learnings

  • 1Understanding the limitations of clustered Elasticsearch setups in high availability scenarios.
  • 2The importance of Cross Cluster Replication (CCR) in maintaining data integrity and availability.
  • 3How to implement workflows for managing Elasticsearch index lifecycles in a high availability context.
  • 4The trade-offs between using a clustered architecture versus independent single-node clusters.
  • 5The necessity of custom solutions for failover and index management in distributed systems.

Who Should Read This

Senior Site Reliability Engineers (SREs) implementing high availability architectures for enterprise applications, particularly those utilizing Elasticsearch.

Test Your Knowledge

?

What are the primary challenges associated with using clustered Elasticsearch in a high availability setup?

?

How does Cross Cluster Replication (CCR) improve data management in GitHub Enterprise Server?

?

What design decisions led to the transition from a clustered architecture to independent single-node Elasticsearch clusters?

?

In what scenarios might a leader/follower pattern fail, and how does the new architecture mitigate these risks?

?

What custom workflows are necessary to manage Elasticsearch index lifecycles effectively in a high availability environment?

Topics

Read Full Article at GitHub