Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Summary

The article outlines Pinterest's transition from a Hadoop-based data processing platform to a Kubernetes-based architecture, specifically leveraging Spark on AWS Elastic Kubernetes Service (EKS). It details the rationale behind this shift, including the need for enhanced performance, cost-effectiveness, and improved developer velocity. The authors discuss the challenges faced during integration, the deployment model, and the supporting frameworks that facilitate this transition. Key components such as the Spark Operator and a new job submission service called Archer are introduced, highlighting their roles in managing Spark applications within the Kubernetes ecosystem.

Key Learnings

1Kubernetes offers superior container management and deployment capabilities compared to Hadoop, making it an attractive alternative for data processing.
2The integration of EKS into Pinterest's existing environment requires careful planning to ensure compatibility and adherence to security practices.
3Utilizing the Spark Operator allows for declarative management of Spark applications, but it introduces challenges such as premature pod cleanup that must be managed.
4Performance tuning in a Kubernetes environment can leverage newer EC2 instance types and autoscaling features to optimize resource usage.
5The development of Archer as a job submission service addresses the limitations of the previous Hadoop-based job submission system, enhancing job tracking and management.

Who Should Read This

Senior Data Engineers designing scalable data processing solutions using Kubernetes and Spark

Test Your Knowledge

What are the key advantages of using Kubernetes over Hadoop for data processing at Pinterest?

How does the Spark Operator enhance the deployment and management of Spark applications in a Kubernetes environment?

What specific challenges did Pinterest face when integrating EKS into their existing infrastructure?

In what ways does the new job submission service Archer improve upon the previous system used with Hadoop?

What considerations must be taken into account when optimizing performance for Spark on EKS?

Topics

Apache Spark Kubernetes Data Processing AWS Big Data

Read Full Article at Pinterest

More from Pinterest Engineering

View Pinterest engineering blogs →

19m

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Apache Spark

Activate first-party data with Meta Conversions API on Databricks

Real-Time Mode: Ultra-low latency streaming on Spark APIs without a second engine

Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why Apache Spark Real-Time Mode Is A Game Changer for Ad Attribution

More from Pinterest Engineering

Unified Context-Intent Embeddings for Scalable Text-to-SQL

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Piqama: Pinterest Quota Management Ecosystem

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Related topics