Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines
Read Full ArticleSummary
The article outlines how Pinterest has expanded the capabilities of Ray beyond traditional training and inference tasks to create a comprehensive machine learning infrastructure. It details the challenges faced with existing Spark-based workflows, such as slow data pipelines and inefficient feature iterations, and describes how Ray's integration has led to significant improvements in feature development, sampling, and labeling processes. Key innovations include the development of a Ray Data native pipeline API, efficient data joining with Iceberg bucket joins, and optimizations for large-scale workloads, resulting in a tenfold reduction in ML iteration times and lower infrastructure costs.
Key Learnings
- 1Ray can be effectively utilized to streamline the entire ML infrastructure, not just training and inference.
- 2Implementing a Ray Data native pipeline API allows for on-the-fly feature transformation, significantly reducing preprocessing time.
- 3Iceberg bucket joins enable efficient feature joining across datasets without the need for extensive precomputation.
- 4Optimizing Ray's data processing capabilities can lead to substantial performance improvements in ML workflows.
- 5The integration of caching mechanisms can enhance iteration speeds and reduce redundant computations in ML experiments.
Who Should Read This
Senior Machine Learning Engineers implementing scalable ML infrastructures using Ray and seeking to optimize data processing workflows.
Test Your Knowledge
What are the specific trade-offs involved in using Iceberg bucket joins versus traditional precomputation methods?
How does the Ray Data native pipeline API facilitate faster feature development and what are its limitations?
In what scenarios might the optimizations for large workloads in Ray not yield the expected performance improvements?
What design decisions were made to balance memory usage and computation speed in the implementation of bucket joins?
How does the caching mechanism in Ray affect the overall workflow of ML experiments and what challenges might arise from it?
Topics
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...