Pinterest
10 min read

Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…

Read Full Article

Summary

The article details the transition from a traditional Two-Tower architecture to a more complex GPU-based model inference system for next-generation ad ranking. It highlights the limitations of the Two-Tower model in capturing user-item interactions and the need for a more expressive architecture. The authors describe various optimizations implemented to maintain low latency while integrating heavy model inference, including feature fetching strategies, moving business logic into the model, and optimizing GPU inference speed. The re-architected system aims to enhance recommendation quality significantly while ensuring efficient resource utilization.

Key Learnings

  • 1The Two-Tower architecture, while efficient, lacks the expressiveness needed for complex user-item interactions, necessitating a shift to more sophisticated models.
  • 2Bundling features into the PyTorch model file eliminates network overhead, significantly reducing latency in feature fetching.
  • 3Moving business logic into the model itself allows for parallel processing on the GPU, improving efficiency and reducing unnecessary data transmission.
  • 4Optimizations such as multi-stream CUDA and kernel fusion can drastically reduce inference latency, enabling real-time performance.
  • 5Rethinking data flow and retrieval strategies can lead to substantial improvements in system performance and latency.

Who Should Read This

Senior Machine Learning Engineers and Principal Engineers focused on optimizing recommendation systems and model inference architectures.

Test Your Knowledge

?

What are the key limitations of the Two-Tower architecture that prompted the need for re-architecture?

?

How does bundling features into the model file impact latency and performance?

?

What trade-offs are involved in moving business logic into the model versus keeping it in the serving system?

?

What specific optimizations were implemented to achieve a reduction in inference latency from 4000ms to 20ms?

?

How did the shift from local ranking to global ranking affect the distribution of ads served and overall metrics?

Topics

Read Full Article at Pinterest