Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…
Read Full ArticleSummary
The article details the transition from a traditional Two-Tower architecture to a more complex GPU-based model inference system for next-generation ad ranking. It highlights the limitations of the Two-Tower model in capturing user-item interactions and the need for a more expressive architecture. The authors describe various optimizations implemented to maintain low latency while integrating heavy model inference, including feature fetching strategies, moving business logic into the model, and optimizing GPU inference speed. The re-architected system aims to enhance recommendation quality significantly while ensuring efficient resource utilization.
Key Learnings
- 1The Two-Tower architecture, while efficient, lacks the expressiveness needed for complex user-item interactions, necessitating a shift to more sophisticated models.
- 2Bundling features into the PyTorch model file eliminates network overhead, significantly reducing latency in feature fetching.
- 3Moving business logic into the model itself allows for parallel processing on the GPU, improving efficiency and reducing unnecessary data transmission.
- 4Optimizations such as multi-stream CUDA and kernel fusion can drastically reduce inference latency, enabling real-time performance.
- 5Rethinking data flow and retrieval strategies can lead to substantial improvements in system performance and latency.
Who Should Read This
Senior Machine Learning Engineers and Principal Engineers focused on optimizing recommendation systems and model inference architectures.
Test Your Knowledge
What are the key limitations of the Two-Tower architecture that prompted the need for re-architecture?
How does bundling features into the model file impact latency and performance?
What trade-offs are involved in moving business logic into the model versus keeping it in the serving system?
What specific optimizations were implemented to achieve a reduction in inference latency from 4000ms to 20ms?
How did the shift from local ranking to global ranking affect the distribution of ads served and overall metrics?
Topics
More articles about Neural Networks
Explore Neural Networks engineering →Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals
The article details the development of a Technical Health Score system at Salesforce, aimed at quantifying platform trust through analytics pipelines that handle petabytes of telemetry data. By...
Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
GenCtrl -- A Formal Controllability Toolkit for Generative Models
The article introduces GenCtrl, a formal controllability toolkit designed for generative models, addressing the critical need for fine-grained control in generative processes. It establishes a...
Multi-Frequency Fusion for Robust Video Face Forgery Detection
The article presents a novel approach to video face forgery detection through a method termed Multi-Frequency Fusion. This technique utilizes a lightweight fusion of two handcrafted cues,...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...