Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
Read Full ArticleSummary
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically identifying tasks with higher memory demands and retrying them on larger executors, Pinterest aims to optimize resource usage and minimize job failures. The article outlines the challenges faced with small executor sizes and the historical frequency of OOM errors, leading to the development of a hybrid strategy that combines increasing CPU properties and launching larger executors. The implementation involves extending core Apache Spark classes to manage task resource profiles effectively, ultimately resulting in a 96% reduction in OOM failures and a more efficient Spark deployment.
Key Learnings
- 1Implementing Auto Memory Retries can drastically reduce OOM errors by dynamically adjusting executor sizes based on task requirements.
- 2The hybrid strategy of increasing CPU properties first before launching larger executors is effective in managing resource allocation without significant overhead.
- 3Creating immutable retry resource profiles allows for systematic scaling of resources based on historical task performance, enhancing job reliability.
- 4Monitoring and gradual rollout of new features are crucial to ensure stability and performance improvements in large-scale systems.
- 5Understanding the memory requirements of tasks and adjusting configurations proactively can lead to significant cost savings and improved system performance.
Who Should Read This
Senior Data Engineers focused on optimizing Apache Spark applications and reducing resource consumption in large-scale data processing environments.
Test Your Knowledge
What are the trade-offs between increasing CPU properties and launching larger executors in the context of managing OOM errors?
How does the implementation of Auto Memory Retries affect the overall resource allocation strategy in Apache Spark?
What challenges did Pinterest face in identifying and tuning executor sizes for their Spark jobs, and how were these addressed?
Why is it important to create immutable retry resource profiles, and how do they contribute to task management in Spark?
In what ways did the monitoring of metrics influence the rollout strategy for the Auto Memory Retries feature?
Topics
More articles about Apache Spark
Explore Apache Spark engineering →Activate first-party data with Meta Conversions API on Databricks
The article introduces the Meta Conversions API as a solution accelerator available on the Databricks Marketplace, aimed at enhancing the activation of first-party data for marketing teams. It...
Real-Time Mode: Ultra-low latency streaming on Spark APIs without a second engine
The article introduces Real-Time Mode (RTM) in Apache Spark, which unifies offline training and ultra-low-latency online feature engineering into a single engine, eliminating the need for separate...
Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative
The article highlights the challenges faced by data engineering teams as they grapple with increasing data volumes and complexities. It emphasizes the limitations of traditional data engineering...
Why Apache Spark Real-Time Mode Is A Game Changer for Ad Attribution
The article discusses the introduction of Apache Spark's Real-Time Mode, which enables millisecond-latency operational streaming workloads for ad attribution. It highlights the use of the...
Next Generation DB Ingestion at Pinterest
The article outlines Pinterest's transition from a legacy batch-oriented database ingestion system to a modern, real-time ingestion framework utilizing Change Data Capture (CDC) technologies. The new...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction
The article presents a significant advancement in Pinterest's ads recommendation system through the introduction of a GPU-serving two-tower model for lightweight ranking. This model architecture...