Pinterest
10 min read

LLM-Powered Relevance Assessment for Pinterest Search

Read Full Article

Summary

The article presents a methodology employed by Pinterest Search to enhance search relevance assessment using fine-tuned large language models (LLMs). It addresses the challenges of traditional human annotation methods, which are costly and limited in scale, by leveraging LLMs to automate the labeling process. The authors detail their approach to measuring semantic relevance between queries and Pins, utilizing a cross-encoder model architecture and a stratified sampling design to improve the sensitivity of A/B testing metrics. The results demonstrate significant improvements in evaluation efficiency and cost reduction, allowing for a more robust assessment of search relevance across various queries and languages.

Key Learnings

  • 1LLMs can significantly reduce the cost and time associated with human labeling for relevance assessment in search systems.
  • 2A stratified sampling design enables more accurate measurement of heterogeneous treatment effects and reduces the minimum detectable effects (MDEs) in A/B testing.
  • 3Fine-tuning multilingual LLMs enhances their applicability across different languages, although performance may vary based on query popularity.
  • 4The use of a cross-encoder architecture allows for effective representation of Pins in relation to search queries, improving relevance predictions.
  • 5Validation of LLM-generated labels against human annotations shows strong alignment, indicating the reliability of LLMs in this context.

Who Should Read This

Senior Machine Learning Engineers implementing LLMs for search relevance optimization

Test Your Knowledge

?

What are the trade-offs between using LLMs and traditional human labeling for relevance assessment?

?

How does the choice of sampling design impact the sensitivity of A/B testing metrics?

?

In what scenarios might the performance of LLMs on non-English queries be inadequate, and how can this be addressed?

?

What design decisions were made in selecting the features used for Pin representation in relevance prediction?

?

Why is it important to minimize the minimum detectable effects (MDEs) in the context of online experimentation?

Topics

Read Full Article at Pinterest

More articles about Large Language Models

Explore Large Language Models engineering →