LLM-Powered Relevance Assessment for Pinterest Search
Read Full ArticleSummary
The article presents a methodology employed by Pinterest Search to enhance search relevance assessment using fine-tuned large language models (LLMs). It addresses the challenges of traditional human annotation methods, which are costly and limited in scale, by leveraging LLMs to automate the labeling process. The authors detail their approach to measuring semantic relevance between queries and Pins, utilizing a cross-encoder model architecture and a stratified sampling design to improve the sensitivity of A/B testing metrics. The results demonstrate significant improvements in evaluation efficiency and cost reduction, allowing for a more robust assessment of search relevance across various queries and languages.
Key Learnings
- 1LLMs can significantly reduce the cost and time associated with human labeling for relevance assessment in search systems.
- 2A stratified sampling design enables more accurate measurement of heterogeneous treatment effects and reduces the minimum detectable effects (MDEs) in A/B testing.
- 3Fine-tuning multilingual LLMs enhances their applicability across different languages, although performance may vary based on query popularity.
- 4The use of a cross-encoder architecture allows for effective representation of Pins in relation to search queries, improving relevance predictions.
- 5Validation of LLM-generated labels against human annotations shows strong alignment, indicating the reliability of LLMs in this context.
Who Should Read This
Senior Machine Learning Engineers implementing LLMs for search relevance optimization
Test Your Knowledge
What are the trade-offs between using LLMs and traditional human labeling for relevance assessment?
How does the choice of sampling design impact the sensitivity of A/B testing metrics?
In what scenarios might the performance of LLMs on non-English queries be inadequate, and how can this be addressed?
What design decisions were made in selecting the features used for Pin representation in relevance prediction?
Why is it important to minimize the minimum detectable effects (MDEs) in the context of online experimentation?
Topics
More articles about Large Language Models
Explore Large Language Models engineering →LogSentinel: How Databricks uses Databricks for LLM-Powered PII Detection and Governance
The article presents LogSentinel, a sophisticated LLM-powered data classification system developed by Databricks for the automatic detection and classification of sensitive data, particularly...
From reactive to proactive: closing the phishing gap with LLMs
The article explores the transition from reactive to proactive email security measures through the integration of Large Language Models (LLMs). It highlights the limitations of traditional email...
How Cloudy translates complex security into human action
The article outlines how Cloudy, an LLM-powered explanation layer integrated into Cloudflare's security products, translates complex machine learning outputs into understandable guidance for security...
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
This paper addresses the critical issue of AI alignment in the context of large language models (LLMs), emphasizing the computational intractability of filtering mechanisms designed to prevent the...
Learning to Reason for Hallucination Span Detection
The paper presents a novel approach to hallucination span detection in large language models (LLMs) by incorporating explicit reasoning into the detection process. Traditional methods often treat...
More from Pinterest Engineering
View Pinterest engineering blogs →Unified Context-Intent Embeddings for Scalable Text-to-SQL
The article outlines Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for enhanced query understanding and SQL...
Unifying Ads Engagement Modeling Across Pinterest Surfaces
The article presents a comprehensive approach to unify ads engagement modeling across different surfaces at Pinterest, addressing the challenges posed by previously independent models. It outlines...
Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models
The article discusses the challenges faced by Pinterest in reconciling offline and online performance metrics of their L1 conversion models. It highlights the discrepancies observed between strong...
Piqama: Pinterest Quota Management Ecosystem
The article introduces Piqama, Pinterest's comprehensive quota management ecosystem designed to oversee resource quotas across various systems. It outlines the architecture of Piqama, emphasizing its...
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
This article details Pinterest's approach to significantly reduce out-of-memory (OOM) errors in their Apache Spark applications through a feature called Auto Memory Retries. By automatically...