Pinterest
18 min read

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Read Full Article

Summary

The article outlines Pinterest's transition from a Hadoop-based data processing platform to a Kubernetes-based architecture, specifically leveraging Spark on AWS Elastic Kubernetes Service (EKS). It details the rationale behind this shift, including the need for enhanced performance, cost-effectiveness, and improved developer velocity. The authors discuss the challenges faced during integration, the deployment model, and the supporting frameworks that facilitate this transition. Key components such as the Spark Operator and a new job submission service called Archer are introduced, highlighting their roles in managing Spark applications within the Kubernetes ecosystem.

Key Learnings

  • 1Kubernetes offers superior container management and deployment capabilities compared to Hadoop, making it an attractive alternative for data processing.
  • 2The integration of EKS into Pinterest's existing environment requires careful planning to ensure compatibility and adherence to security practices.
  • 3Utilizing the Spark Operator allows for declarative management of Spark applications, but it introduces challenges such as premature pod cleanup that must be managed.
  • 4Performance tuning in a Kubernetes environment can leverage newer EC2 instance types and autoscaling features to optimize resource usage.
  • 5The development of Archer as a job submission service addresses the limitations of the previous Hadoop-based job submission system, enhancing job tracking and management.

Who Should Read This

Senior Data Engineers designing scalable data processing solutions using Kubernetes and Spark

Test Your Knowledge

?

What are the key advantages of using Kubernetes over Hadoop for data processing at Pinterest?

?

How does the Spark Operator enhance the deployment and management of Spark applications in a Kubernetes environment?

?

What specific challenges did Pinterest face when integrating EKS into their existing infrastructure?

?

In what ways does the new job submission service Archer improve upon the previous system used with Hadoop?

?

What considerations must be taken into account when optimizing performance for Spark on EKS?

Topics

Read Full Article at Pinterest