Pinterest
10 min read

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

Read Full Article

Summary

The article chronicles the migration of Pinterest’s search infrastructure, Manas, to Kubernetes, highlighting a significant performance issue where one in a million search requests experienced extreme latency. The team undertook a systematic investigation, employing both clearbox and blackbox debugging techniques to identify the root cause of the latency spikes, which was traced back to cAdvisor's intrusive memory metrics collection. The resolution involved disabling the problematic metric, thereby stabilizing the search service's performance. This case study illustrates the complexities of migrating large-scale systems and the importance of resource isolation and debugging strategies in distributed environments.

Key Learnings

  • 1The migration of complex systems like Manas to Kubernetes requires careful consideration of resource management and monitoring tools.
  • 2Identifying performance bottlenecks in distributed systems often necessitates a combination of profiling and isolating variables to narrow down potential causes.
  • 3Intrusive monitoring metrics can significantly impact performance, necessitating a balance between observability and system efficiency.
  • 4Effective debugging in distributed systems can involve both clearbox and blackbox approaches, and sometimes requires unconventional methods like process suspension to identify issues.

Who Should Read This

Senior Site Reliability Engineers (SREs) managing Kubernetes clusters for high-traffic applications facing performance challenges.

Test Your Knowledge

?

What are the trade-offs involved in using cAdvisor for monitoring in a memory-intensive application like Manas?

?

How does the architecture of Manas contribute to its performance challenges during the migration to Kubernetes?

?

What specific Linux kernel features were identified as contributing to the latency issues, and how did they interact with Kubernetes?

?

In what ways can resource isolation strategies like CPU shielding fail in a Kubernetes environment, and what alternatives could be considered?

?

Why is it important to validate that issues are not caused by the underlying AMI when troubleshooting performance in Kubernetes?

Topics

Read Full Article at Pinterest