Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes
Read Full ArticleSummary
The article chronicles the migration of Pinterest’s search infrastructure, Manas, to Kubernetes, highlighting a significant performance issue where one in a million search requests experienced extreme latency. The team undertook a systematic investigation, employing both clearbox and blackbox debugging techniques to identify the root cause of the latency spikes, which was traced back to cAdvisor's intrusive memory metrics collection. The resolution involved disabling the problematic metric, thereby stabilizing the search service's performance. This case study illustrates the complexities of migrating large-scale systems and the importance of resource isolation and debugging strategies in distributed environments.
Key Learnings
- 1The migration of complex systems like Manas to Kubernetes requires careful consideration of resource management and monitoring tools.
- 2Identifying performance bottlenecks in distributed systems often necessitates a combination of profiling and isolating variables to narrow down potential causes.
- 3Intrusive monitoring metrics can significantly impact performance, necessitating a balance between observability and system efficiency.
- 4Effective debugging in distributed systems can involve both clearbox and blackbox approaches, and sometimes requires unconventional methods like process suspension to identify issues.
Who Should Read This
Senior Site Reliability Engineers (SREs) managing Kubernetes clusters for high-traffic applications facing performance challenges.
Test Your Knowledge
What are the trade-offs involved in using cAdvisor for monitoring in a memory-intensive application like Manas?
How does the architecture of Manas contribute to its performance challenges during the migration to Kubernetes?
What specific Linux kernel features were identified as contributing to the latency issues, and how did they interact with Kubernetes?
In what ways can resource isolation strategies like CPU shielding fail in a Kubernetes environment, and what alternatives could be considered?
Why is it important to validate that issues are not caused by the underlying AMI when troubleshooting performance in Kubernetes?
Topics
More articles about High Availability
Explore High Availability engineering →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Best Practices for High QPS Model Serving on Databricks
The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...