Using LLMs to amplify human labeling and improve Dash search relevance
Read Full ArticleSummary
The article outlines how Dropbox Dash utilizes a retrieval-augmented generation (RAG) approach to enhance search relevance by integrating large language models (LLMs) with human labeling. It explains the importance of training relevance models using a combination of human-labeled data and LLM-generated labels, emphasizing the need for high-quality relevance judgments to improve search outcomes. The process involves validating LLM performance against human evaluations, optimizing prompts for better accuracy, and continuously refining the model based on user behavior and feedback. The article also highlights the challenges of human labeling, such as scalability and consistency, and how LLMs can serve as a cost-effective and efficient alternative for generating relevance labels at scale.
Key Learnings
- 1Combining human labeling with LLM-generated relevance judgments can significantly scale the training data for search ranking models.
- 2LLMs must be carefully calibrated and validated against human judgments to ensure the quality of generated relevance labels.
- 3The relevance evaluation process requires context that may not be present in the query or document, necessitating additional tools for LLMs to research user intent.
- 4Prompt optimization is critical for improving LLM performance and requires iterative testing and refinement to maintain consistency.
- 5Human grounding remains essential in the evaluation process to anchor LLM-generated labels and ensure correctness over time.
Who Should Read This
Senior Machine Learning Engineers implementing AI-driven search solutions in enterprise environments.
Test Your Knowledge
What are the trade-offs between using human labeling and LLM-generated relevance judgments in training search models?
How does the retrieval-augmented generation (RAG) pattern enhance the performance of enterprise search systems?
What specific challenges arise when using LLMs for relevance evaluation, and how can they be mitigated?
In what ways can user behavior inform the generation of relevance labels, and what limitations does it have?
How does the process of prompt optimization affect the accuracy of LLM-generated relevance judgments?
Topics
More articles about Large Language Models
Explore Large Language Models engineering →LogSentinel: How Databricks uses Databricks for LLM-Powered PII Detection and Governance
The article presents LogSentinel, a sophisticated LLM-powered data classification system developed by Databricks for the automatic detection and classification of sensitive data, particularly...
From reactive to proactive: closing the phishing gap with LLMs
The article explores the transition from reactive to proactive email security measures through the integration of Large Language Models (LLMs). It highlights the limitations of traditional email...
How Cloudy translates complex security into human action
The article outlines how Cloudy, an LLM-powered explanation layer integrated into Cloudflare's security products, translates complex machine learning outputs into understandable guidance for security...
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
This paper addresses the critical issue of AI alignment in the context of large language models (LLMs), emphasizing the computational intractability of filtering mechanisms designed to prevent the...
Learning to Reason for Hallucination Span Detection
The paper presents a novel approach to hallucination span detection in large language models (LLMs) by incorporating explicit reasoning into the detection process. Traditional methods often treat...
More from Dropbox Engineering
View Dropbox engineering blogs →How low-bit inference enables efficient AI
The article discusses the advancements in large machine learning models and the challenges associated with their deployment, particularly focusing on low-bit inference techniques that enhance...
Insights from our executive roundtable on AI and engineering productivity
The article provides insights into Dropbox's approach to enhancing engineering productivity through the adoption of AI tools. It highlights the importance of aligning AI initiatives with business...
Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash
In this article, Josh Clemm discusses the technical architecture behind Dropbox Dash, focusing on the integration of knowledge graphs, retrieval methods, and the use of large language models (LLMs)....
Inside the feature store powering real-time AI in Dropbox Dash
The article delves into the implementation of a feature store that powers the AI-driven Dropbox Dash, focusing on how it manages and delivers data signals for effective ranking and retrieval of...
Building the future: highlights from Dropbox’s 2025 summer intern class
The article highlights the contributions of Dropbox interns during the 2025 summer program, showcasing a variety of technical projects that leverage AI and enhance system performance. Interns worked...