Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…

Summary

The article details the transition from a traditional Two-Tower architecture to a more complex GPU-based model inference system for next-generation ad ranking. It highlights the limitations of the Two-Tower model in capturing user-item interactions and the need for a more expressive architecture. The authors describe various optimizations implemented to maintain low latency while integrating heavy model inference, including feature fetching strategies, moving business logic into the model, and optimizing GPU inference speed. The re-architected system aims to enhance recommendation quality significantly while ensuring efficient resource utilization.

Key Learnings

1The Two-Tower architecture, while efficient, lacks the expressiveness needed for complex user-item interactions, necessitating a shift to more sophisticated models.
2Bundling features into the PyTorch model file eliminates network overhead, significantly reducing latency in feature fetching.
3Moving business logic into the model itself allows for parallel processing on the GPU, improving efficiency and reducing unnecessary data transmission.
4Optimizations such as multi-stream CUDA and kernel fusion can drastically reduce inference latency, enabling real-time performance.
5Rethinking data flow and retrieval strategies can lead to substantial improvements in system performance and latency.

Who Should Read This

Senior Machine Learning Engineers and Principal Engineers focused on optimizing recommendation systems and model inference architectures.

Test Your Knowledge

What are the key limitations of the Two-Tower architecture that prompted the need for re-architecture?

How does bundling features into the model file impact latency and performance?

What trade-offs are involved in moving business logic into the model versus keeping it in the serving system?

What specific optimizations were implemented to achieve a reduction in inference latency from 4000ms to 20ms?

How did the shift from local ranking to global ranking affect the distribution of ads served and overall metrics?

Topics

Neural Networks Machine Learning Deep Learning Reinforcement Learning Transfer Learning

Read Full Article at Pinterest

More from Pinterest Engineering

View Pinterest engineering blogs →

19m

Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Neural Networks

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

Unified Context-Intent Embeddings for Scalable Text-to-SQL

GenCtrl -- A Formal Controllability Toolkit for Generative Models

Multi-Frequency Fusion for Robust Video Face Forgery Detection

Unifying Ads Engagement Modeling Across Pinterest Surfaces

More from Pinterest Engineering

Unified Context-Intent Embeddings for Scalable Text-to-SQL

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Piqama: Pinterest Quota Management Ecosystem

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Related topics