Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Summary

The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG serves as a centralized Ethernet-based super spine network layer that connects various spine layer fabrics, enabling the creation of mega AI clusters with immense bandwidth capacities. The article details the architectural choices made for BAG, including modular hardware, advanced routing techniques, and strategies for resilience and performance optimization. It also explores the specific technologies used, such as Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), and the implications of these choices for future scalability and innovation in AI infrastructure.

Key Learnings

1Backend aggregation (BAG) is critical for interconnecting large-scale AI clusters, enabling high-capacity networking across multiple data centers.
2The choice of network topology (planar vs. spread) impacts failure domains and resilience, necessitating careful planning in deployment.
3Modular hardware and advanced routing protocols, such as eBGP with UCMP, enhance load balancing and failure handling in BAG implementations.
4Effective management of oversubscription ratios is essential for balancing performance and scalability in large AI deployments.
5BAG's architecture allows for optimized buffer utilization, which is crucial for maintaining performance in lossless congestion control scenarios.

Who Should Read This

Senior Network Engineers designing high-capacity interconnects for large-scale AI infrastructures

Test Your Knowledge

What are the trade-offs between planar and spread connection topologies in the context of BAG?

How does the choice of modular hardware impact the scalability and reliability of the BAG architecture?

What specific strategies are employed to mitigate risks associated with blackholing in the BAG network?

In what ways does the implementation of eBGP with UCMP improve load balancing and failure resilience?

How does the management of oversubscription ratios influence the performance of AI clusters connected via BAG?

Topics

High Availability Load Shedding Resilience Engineering Service Discovery Failover

Read Full Article at Meta (Facebook)

More from Meta (Facebook) Engineering

View Meta (Facebook) engineering blogs →

Meta (Facebook)

14m

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about High Availability

Scaling Jira cloud Migrations, One Bottleneck at a Time

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Best Practices for High QPS Model Serving on Databricks

My Journey to Airbnb — Anna Sulkina

More from Meta (Facebook) Engineering

How Advanced Browsing Protection Works in Messenger

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

FFmpeg at Meta: Media Processing at Scale

RCCLX: Innovating GPU communications on AMD platforms

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Related topics