Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Summary

This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically identifying tasks with higher memory demands and retrying them on larger executors, Pinterest aims to optimize resource usage and minimize job failures. The article outlines the challenges faced with small executor sizes and the historical frequency of OOM errors, leading to the development of a hybrid strategy that combines increasing CPU properties and launching larger executors. The implementation involves extending core Apache Spark classes to manage task resource profiles effectively, ultimately resulting in a 96% reduction in OOM failures and a more efficient Spark deployment.

Key Learnings

1Implementing Auto Memory Retries can drastically reduce OOM errors by dynamically adjusting executor sizes based on task requirements.
2The hybrid strategy of increasing CPU properties first before launching larger executors is effective in managing resource allocation without significant overhead.
3Creating immutable retry resource profiles allows for systematic scaling of resources based on historical task performance, enhancing job reliability.
4Monitoring and gradual rollout of new features are crucial to ensure stability and performance improvements in large-scale systems.
5Understanding the memory requirements of tasks and adjusting configurations proactively can lead to significant cost savings and improved system performance.

Who Should Read This

Senior Data Engineers focused on optimizing Apache Spark applications and reducing resource consumption in large-scale data processing environments.

Test Your Knowledge

What are the trade-offs between increasing CPU properties and launching larger executors in the context of managing OOM errors?

How does the implementation of Auto Memory Retries affect the overall resource allocation strategy in Apache Spark?

What challenges did Pinterest face in identifying and tuning executor sizes for their Spark jobs, and how were these addressed?

Why is it important to create immutable retry resource profiles, and how do they contribute to task management in Spark?

In what ways did the monitoring of metrics influence the rollout strategy for the Auto Memory Retries feature?

Topics

Apache Spark Data Quality Etl Pipelines Data Governance

Read Full Article at Pinterest

More from Pinterest Engineering

View Pinterest engineering blogs →

19m

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Apache Spark

Activate first-party data with Meta Conversions API on Databricks

Real-Time Mode: Ultra-low latency streaming on Spark APIs without a second engine

Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative

Why Apache Spark Real-Time Mode Is A Game Changer for Ad Attribution

Next Generation DB Ingestion at Pinterest

More from Pinterest Engineering

Unified Context-Intent Embeddings for Scalable Text-to-SQL

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Piqama: Pinterest Quota Management Ecosystem

GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction

Related topics