Netflix

•

24 min read

•September 29, 2025

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Summary

The article discusses a significant upgrade to the Maestro workflow engine at Netflix, achieving a performance improvement of 100X by reducing execution overhead from seconds to milliseconds. It details the architectural evolution of Maestro, highlighting the transition from a multi-layered system with high overhead to a simplified architecture that maintains state in memory and optimizes task execution. The new design enhances scalability and fault tolerance while addressing previous bottlenecks related to race conditions and job execution delays. The implementation of a new internal flow engine and optimized queue management are key components of this transformation.

Key Learnings

1The transition to an in-memory state management system significantly reduces latency and improves throughput for workflow executions.
2By simplifying the architecture and removing unnecessary dependencies, the new Maestro engine enhances operational reliability and performance.
3The introduction of flow groups allows for efficient partitioning and ownership management of workflows, maintaining horizontal scalability while improving speed.
4Replacing external distributed job queues with internal ones streamlines processing and ensures strong delivery guarantees, addressing previous edge cases.
5The decision to rewrite the internal flow engine was driven by the need to eliminate complexity and improve performance tailored to Maestro's specific use cases.

Who Should Read This

Senior Software Architects designing high-performance workflow orchestration systems in large-scale environments

Test Your Knowledge

What were the primary bottlenecks in the original Maestro architecture that necessitated a redesign?

How does the new in-memory state management improve the performance of the Maestro engine compared to the previous design?

What trade-offs were considered when deciding between upgrading the existing Conductor library and implementing a new internal flow engine?

In what ways does the flow group concept enhance both scalability and performance in the Maestro workflow engine?

What guarantees does the new internal queue system provide to ensure reliable job processing and execution?

Topics

Microservices Event-driven Architecture Layered Architecture Service Mesh

Read Full Article at Netflix

More from Netflix Engineering

View Netflix engineering blogs →

Netflix

10m

ML Observability: Bringing Transparency to Payments and Beyond

The article explores the critical role of ML observability in enhancing the performance and reliability of machine learning models, particularly in payment processing at Netflix. It emphasizes the...

Netflix

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

The article outlines the transformation of data engineering at Netflix, emphasizing the shift from traditional data practices to a new specialization known as Media ML Data Engineering. This...

Netflix

Empowering Netflix Engineers with Incident Management

The article outlines Netflix's journey to democratize incident management, shifting from a centralized model to empowering engineering teams across the organization. It emphasizes the importance of a...

Netflix

10m

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

The article discusses Netflix's Muse application, which aims to deliver data-driven insights for content discovery. It highlights the evolution of Muse's architecture from a simple dashboard to a...

Netflix

15m

Building a Resilient Data Platform with Write-Ahead Log at Netflix

The article details Netflix's approach to building a resilient data platform using a Write-Ahead Log (WAL) system to address challenges such as data loss, corruption, and system entropy across...

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Microservices

You can't stream the energy: A developer's guide to Google Cloud Next '26 in Vegas

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

Safeguarding Dynamic Configuration Changes at Scale

My Journey to Airbnb — Anna Sulkina

The Container paradox: Why the Inference Cloud Demands a “Decoupled” Database

More from Netflix Engineering

ML Observability: Bringing Transparency to Payments and Beyond

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

Empowering Netflix Engineers with Incident Management

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Building a Resilient Data Platform with Write-Ahead Log at Netflix

Related topics