Netflix

•

10 min read

•September 22, 2025

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Summary

The article discusses Netflix's Muse application, which aims to deliver data-driven insights for content discovery. It highlights the evolution of Muse's architecture from a simple dashboard to a complex system capable of handling trillions of rows of data. Key advancements include the integration of HyperLogLog sketches for efficient distinct counting, the use of the Hollow library for in-memory data access, and various optimizations to the Druid cluster to improve query performance. The article also emphasizes the importance of rigorous validation processes during the rollout of these architectural changes to ensure data integrity and user trust.

Key Learnings

1Utilizing HyperLogLog sketches significantly reduces the resource intensity of distinct counting in distributed systems, achieving a balance between performance and accuracy.
2The Hollow library enables efficient in-memory storage and access, allowing for rapid data retrieval and reducing the load on the Druid cluster.
3Optimizing Druid configurations, such as segment sizes and broker counts, is crucial for maintaining high throughput and low latency in query performance.
4Implementing a parallel stack deployment strategy allows for effective validation of new metrics systems while minimizing risk during transitions.
5Adopting a combination of automated validation tools and in-app comparison features enhances the ability to monitor and ensure data quality throughout the rollout process.

Who Should Read This

Senior Data Engineers optimizing OLAP systems and enhancing data processing pipelines for large-scale analytics.

Test Your Knowledge

What trade-offs are involved in using HyperLogLog sketches for distinct counts compared to exact counting methods?

How does the use of the Hollow library impact the overall architecture and performance of the Muse application?

What specific optimizations were made to the Druid cluster, and how do they affect query performance under high concurrency?

In what scenarios might the reliance on precomputed aggregates lead to inaccuracies in the data served by Muse?

What validation strategies were employed to ensure data integrity during the rollout of the new metrics system?

Topics

Apache Spark Apache Druid Etl Pipelines Data Quality Data Warehousing

Read Full Article at Netflix

More from Netflix Engineering

View Netflix engineering blogs →

Netflix

10m

ML Observability: Bringing Transparency to Payments and Beyond

The article explores the critical role of ML observability in enhancing the performance and reliability of machine learning models, particularly in payment processing at Netflix. It emphasizes the...

Netflix

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

The article outlines the transformation of data engineering at Netflix, emphasizing the shift from traditional data practices to a new specialization known as Media ML Data Engineering. This...

Netflix

Empowering Netflix Engineers with Incident Management

The article outlines Netflix's journey to democratize incident management, shifting from a centralized model to empowering engineering teams across the organization. It emphasizes the importance of a...

Netflix

15m

Building a Resilient Data Platform with Write-Ahead Log at Netflix

The article details Netflix's approach to building a resilient data platform using a Write-Ahead Log (WAL) system to address challenges such as data loss, corruption, and system entropy across...

Netflix

24m

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

The article discusses a significant upgrade to the Maestro workflow engine at Netflix, achieving a performance improvement of 100X by reducing execution overhead from seconds to milliseconds. It...

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Apache Spark

Activate first-party data with Meta Conversions API on Databricks

Real-Time Mode: Ultra-low latency streaming on Spark APIs without a second engine

Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why Apache Spark Real-Time Mode Is A Game Changer for Ad Attribution

More from Netflix Engineering

ML Observability: Bringing Transparency to Payments and Beyond

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

Empowering Netflix Engineers with Incident Management

Building a Resilient Data Platform with Write-Ahead Log at Netflix

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Related topics