Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines

Summary

The article outlines how Pinterest has expanded the capabilities of Ray beyond traditional training and inference tasks to create a comprehensive machine learning infrastructure. It details the challenges faced with existing Spark-based workflows, such as slow data pipelines and inefficient feature iterations, and describes how Ray's integration has led to significant improvements in feature development, sampling, and labeling processes. Key innovations include the development of a Ray Data native pipeline API, efficient data joining with Iceberg bucket joins, and optimizations for large-scale workloads, resulting in a tenfold reduction in ML iteration times and lower infrastructure costs.

Key Learnings

1Ray can be effectively utilized to streamline the entire ML infrastructure, not just training and inference.
2Implementing a Ray Data native pipeline API allows for on-the-fly feature transformation, significantly reducing preprocessing time.
3Iceberg bucket joins enable efficient feature joining across datasets without the need for extensive precomputation.
4Optimizing Ray's data processing capabilities can lead to substantial performance improvements in ML workflows.
5The integration of caching mechanisms can enhance iteration speeds and reduce redundant computations in ML experiments.

Who Should Read This

Senior Machine Learning Engineers implementing scalable ML infrastructures using Ray and seeking to optimize data processing workflows.

Test Your Knowledge

What are the specific trade-offs involved in using Iceberg bucket joins versus traditional precomputation methods?

How does the Ray Data native pipeline API facilitate faster feature development and what are its limitations?

In what scenarios might the optimizations for large workloads in Ray not yield the expected performance improvements?

What design decisions were made to balance memory usage and computation speed in the implementation of bucket joins?

How does the caching mechanism in Ray affect the overall workflow of ML experiments and what challenges might arise from it?

Topics

Ray Machine Learning Data Processing Feature Engineering Optimization

Read Full Article at Pinterest

More from Pinterest Engineering

View Pinterest engineering blogs →

19m

Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More from Pinterest Engineering

Unified Context-Intent Embeddings for Scalable Text-to-SQL

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Piqama: Pinterest Quota Management Ecosystem

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Related topics