Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

Summary

The article chronicles the migration of Pinterest’s search infrastructure, Manas, to Kubernetes, highlighting a significant performance issue where one in a million search requests experienced extreme latency. The team undertook a systematic investigation, employing both clearbox and blackbox debugging techniques to identify the root cause of the latency spikes, which was traced back to cAdvisor's intrusive memory metrics collection. The resolution involved disabling the problematic metric, thereby stabilizing the search service's performance. This case study illustrates the complexities of migrating large-scale systems and the importance of resource isolation and debugging strategies in distributed environments.

Key Learnings

1The migration of complex systems like Manas to Kubernetes requires careful consideration of resource management and monitoring tools.
2Identifying performance bottlenecks in distributed systems often necessitates a combination of profiling and isolating variables to narrow down potential causes.
3Intrusive monitoring metrics can significantly impact performance, necessitating a balance between observability and system efficiency.
4Effective debugging in distributed systems can involve both clearbox and blackbox approaches, and sometimes requires unconventional methods like process suspension to identify issues.

Who Should Read This

Senior Site Reliability Engineers (SREs) managing Kubernetes clusters for high-traffic applications facing performance challenges.

Test Your Knowledge

What are the trade-offs involved in using cAdvisor for monitoring in a memory-intensive application like Manas?

How does the architecture of Manas contribute to its performance challenges during the migration to Kubernetes?

What specific Linux kernel features were identified as contributing to the latency issues, and how did they interact with Kubernetes?

In what ways can resource isolation strategies like CPU shielding fail in a Kubernetes environment, and what alternatives could be considered?

Why is it important to validate that issues are not caused by the underlying AMI when troubleshooting performance in Kubernetes?

Topics

High Availability Latency Load Shedding Service Discovery Memory Management

Read Full Article at Pinterest

More from Pinterest Engineering

View Pinterest engineering blogs →

19m

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about High Availability

Scaling Jira cloud Migrations, One Bottleneck at a Time

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Best Practices for High QPS Model Serving on Databricks

My Journey to Airbnb — Anna Sulkina

More from Pinterest Engineering

Unified Context-Intent Embeddings for Scalable Text-to-SQL

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Piqama: Pinterest Quota Management Ecosystem

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Related topics