Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)
Read Full ArticleSummary
The article discusses Pinterest's transition from a Hadoop-based data processing platform to Moka, a next-generation system designed for massive-scale data processing. It highlights the deployment of Moka on AWS Elastic Kubernetes Service (EKS), detailing the use of Terraform for infrastructure management and the implementation of a comprehensive logging and observability framework. The article also covers the challenges and solutions in managing container images and ensuring effective metrics collection and analysis for operational efficiency.
Key Learnings
- 1Moka's deployment on AWS EKS is structured into multiple environments (test, dev, staging, production) to ensure isolation and security.
- 2The logging infrastructure leverages Fluent Bit for efficient log management, enabling the aggregation of Spark application logs and system pod logs in Amazon S3.
- 3Observability is enhanced through a combination of Prometheus and OpenTelemetry, allowing for detailed insights into the performance of EKS clusters.
- 4The article emphasizes the importance of containerization in Moka, ensuring full isolation and compatibility across different architectures (Intel and ARM).
- 5The use of Terraform modules facilitates a modular and reusable approach to infrastructure as code, streamlining the deployment process.
Who Should Read This
Senior Data Engineers implementing scalable data processing solutions on cloud platforms like AWS
Test Your Knowledge
What are the trade-offs of using AWS EKS for deploying Moka compared to traditional Hadoop clusters?
How does Fluent Bit enhance the logging capabilities of Spark applications running on Moka?
What design decisions were made to ensure observability in the Moka platform, and what challenges did they address?
In what ways does the containerization strategy in Moka differ from the previous Monarch platform, and why is this significant?
How does the architecture of Moka support scalability and reliability in data processing workloads?
Topics
More articles about AWS
Explore AWS engineering →Complexity is a choice. SASE migrations shouldn’t take years.
The article emphasizes the shift in the cybersecurity landscape regarding SASE migrations, arguing that complexity is a choice rather than an inevitability. It showcases how Cloudflare's SASE...
AWS Weekly Roundup: Amazon Connect Health, Bedrock AgentCore Policy, GameDay Europe, and more (March 9, 2026)
The article provides a comprehensive overview of recent updates and launches from AWS, highlighting innovations such as Amazon Connect Health, which offers AI-driven solutions for healthcare, and the...
Native .NET Buildpack Support is Now Available on App Platform
DigitalOcean has announced native .NET buildpack support on its App Platform, enabling developers to deploy .NET applications directly from a Git repository without the need for Dockerfiles. The...
Introducing OpenClaw on Amazon Lightsail to run your autonomous private AI agents
The article introduces OpenClaw, an autonomous private AI agent, now available on Amazon Lightsail. It details the process of launching an OpenClaw instance, which is pre-configured with Amazon...
See risk, fix risk: introducing Remediation in Cloudflare CASB
The article introduces a significant enhancement to Cloudflare's Cloud Access Security Broker (CASB) by launching a Remediation feature that allows users to directly fix risky file-sharing...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...