Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)
Read Full ArticleSummary
The article outlines Pinterest's transition from a Hadoop-based data processing platform to a Kubernetes-based architecture, specifically leveraging Spark on AWS Elastic Kubernetes Service (EKS). It details the rationale behind this shift, including the need for enhanced performance, cost-effectiveness, and improved developer velocity. The authors discuss the challenges faced during integration, the deployment model, and the supporting frameworks that facilitate this transition. Key components such as the Spark Operator and a new job submission service called Archer are introduced, highlighting their roles in managing Spark applications within the Kubernetes ecosystem.
Key Learnings
- 1Kubernetes offers superior container management and deployment capabilities compared to Hadoop, making it an attractive alternative for data processing.
- 2The integration of EKS into Pinterest's existing environment requires careful planning to ensure compatibility and adherence to security practices.
- 3Utilizing the Spark Operator allows for declarative management of Spark applications, but it introduces challenges such as premature pod cleanup that must be managed.
- 4Performance tuning in a Kubernetes environment can leverage newer EC2 instance types and autoscaling features to optimize resource usage.
- 5The development of Archer as a job submission service addresses the limitations of the previous Hadoop-based job submission system, enhancing job tracking and management.
Who Should Read This
Senior Data Engineers designing scalable data processing solutions using Kubernetes and Spark
Test Your Knowledge
What are the key advantages of using Kubernetes over Hadoop for data processing at Pinterest?
How does the Spark Operator enhance the deployment and management of Spark applications in a Kubernetes environment?
What specific challenges did Pinterest face when integrating EKS into their existing infrastructure?
In what ways does the new job submission service Archer improve upon the previous system used with Hadoop?
What considerations must be taken into account when optimizing performance for Spark on EKS?
Topics
More articles about Apache Spark
Explore Apache Spark engineering →Activate first-party data with Meta Conversions API on Databricks
The article introduces the Meta Conversions API as a solution accelerator available on the Databricks Marketplace, aimed at enhancing the activation of first-party data for marketing teams. It...
Real-Time Mode: Ultra-low latency streaming on Spark APIs without a second engine
The article introduces Real-Time Mode (RTM) in Apache Spark, which unifies offline training and ultra-low-latency online feature engineering into a single engine, eliminating the need for separate...
Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative
The article highlights the challenges faced by data engineering teams as they grapple with increasing data volumes and complexities. It emphasizes the limitations of traditional data engineering...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...
Why Apache Spark Real-Time Mode Is A Game Changer for Ad Attribution
The article discusses the introduction of Apache Spark's Real-Time Mode, which enables millisecond-latency operational streaming workloads for ad attribution. It highlights the use of the...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...