Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
Read Full ArticleSummary
The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG serves as a centralized Ethernet-based super spine network layer that connects various spine layer fabrics, enabling the creation of mega AI clusters with immense bandwidth capacities. The article details the architectural choices made for BAG, including modular hardware, advanced routing techniques, and strategies for resilience and performance optimization. It also explores the specific technologies used, such as Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), and the implications of these choices for future scalability and innovation in AI infrastructure.
Key Learnings
- 1Backend aggregation (BAG) is critical for interconnecting large-scale AI clusters, enabling high-capacity networking across multiple data centers.
- 2The choice of network topology (planar vs. spread) impacts failure domains and resilience, necessitating careful planning in deployment.
- 3Modular hardware and advanced routing protocols, such as eBGP with UCMP, enhance load balancing and failure handling in BAG implementations.
- 4Effective management of oversubscription ratios is essential for balancing performance and scalability in large AI deployments.
- 5BAG's architecture allows for optimized buffer utilization, which is crucial for maintaining performance in lossless congestion control scenarios.
Who Should Read This
Senior Network Engineers designing high-capacity interconnects for large-scale AI infrastructures
Test Your Knowledge
What are the trade-offs between planar and spread connection topologies in the context of BAG?
How does the choice of modular hardware impact the scalability and reliability of the BAG architecture?
What specific strategies are employed to mitigate risks associated with blackholing in the BAG network?
In what ways does the implementation of eBGP with UCMP improve load balancing and failure resilience?
How does the management of oversubscription ratios influence the performance of AI clusters connected via BAG?
Topics
More articles about High Availability
Explore High Availability engineering →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Best Practices for High QPS Model Serving on Databricks
The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
More from Meta (Facebook) Engineering
View Meta (Facebook) engineering blogs →How Advanced Browsing Protection Works in Messenger
The article discusses the implementation of Advanced Browsing Protection (ABP) in Messenger, focusing on the technical challenges and infrastructure necessary to protect user privacy while analyzing...
Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc
Meta has reaffirmed its commitment to jemalloc, a high-performance memory allocator, recognizing its importance in the software infrastructure. The article outlines Meta's strategic focus on reducing...
FFmpeg at Meta: Media Processing at Scale
The article discusses the extensive use of FFmpeg at Meta for media processing, highlighting the challenges and optimizations involved in transcoding and encoding videos at scale. It details how Meta...
RCCLX: Innovating GPU communications on AMD platforms
The article introduces RCCLX, an open-source library developed to enhance GPU communications on AMD platforms, building on the previous RCCL framework. It integrates with Torchcomms to facilitate...
The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It
The article introduces the concept of Just-in-Time Tests (JiTTests), a transformative approach to software testing that leverages large language models (LLMs) to generate bespoke tests automatically...