Dropbox

•

11 min read

•October 22, 2025

Half-Quadratic Quantization of large machine learning models

Summary

The article introduces Half-Quadratic Quantization (HQQ), a novel quantization technique aimed at optimizing large language models (LLMs) for deployment on consumer-grade hardware. HQQ is designed to minimize memory requirements while maintaining model accuracy, achieving significant speed improvements over existing calibration-based methods like GPTQ and AWQ. The method operates without the need for calibration data, allowing for rapid quantization of large models such as Llama-2-70B in under five minutes. The article also discusses the mathematical foundations of HQQ, including the use of sparsity-promoting loss functions and optimization techniques that enable efficient processing of large-scale models.

Key Learnings

1HQQ can quantize large models significantly faster than traditional methods, achieving over 50x speed improvements.
2The method does not require calibration data, addressing common issues of data bias and computational overhead in quantization.
3Incorporating a sparsity-promoting loss function allows for better handling of outlier weights, improving quantization quality.
4HQQ's optimization formulation leverages closed-form solutions, enabling efficient inference mode calculations without gradient descent.
5The technique demonstrates competitive performance against calibration-based methods, making it suitable for deployment in resource-constrained environments.

Who Should Read This

Senior Machine Learning Engineers implementing efficient model deployment strategies for large language models.

Test Your Knowledge

What are the main advantages of using Half-Quadratic Quantization over calibration-based methods?

How does the choice of loss function impact the performance of quantization techniques?

What challenges does model quantization address in the context of deploying large language models?

In what scenarios might the lack of calibration data be beneficial for quantization?

How does the optimization formulation of HQQ differ from traditional gradient descent methods?

Topics

Large Language Models Quantization Machine Learning Deep Learning Generative AI

Read Full Article at Dropbox

More from Dropbox Engineering

View Dropbox engineering blogs →

Dropbox

11m

Using LLMs to amplify human labeling and improve Dash search relevance

The article outlines how Dropbox Dash utilizes a retrieval-augmented generation (RAG) approach to enhance search relevance by integrating large language models (LLMs) with human labeling. It explains...

Dropbox

14m

How low-bit inference enables efficient AI

The article discusses the advancements in large machine learning models and the challenges associated with their deployment, particularly focusing on low-bit inference techniques that enhance...

Dropbox

Insights from our executive roundtable on AI and engineering productivity

The article provides insights into Dropbox's approach to enhancing engineering productivity through the adoption of AI tools. It highlights the importance of aligning AI initiatives with business...

Dropbox

17m

Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash

In this article, Josh Clemm discusses the technical architecture behind Dropbox Dash, focusing on the integration of knowledge graphs, retrieval methods, and the use of large language models (LLMs)....

Dropbox

Inside the feature store powering real-time AI in Dropbox Dash

The article delves into the implementation of a feature store that powers the AI-driven Dropbox Dash, focusing on how it manages and delivers data signals for effective ranking and retrieval of...

Half-Quadratic Quantization of large machine learning models

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Large Language Models

LogSentinel: How Databricks uses Databricks for LLM-Powered PII Detection and Governance

From reactive to proactive: closing the phishing gap with LLMs

How Cloudy translates complex security into human action

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Learning to Reason for Hallucination Span Detection

More from Dropbox Engineering

Using LLMs to amplify human labeling and improve Dash search relevance

How low-bit inference enables efficient AI

Insights from our executive roundtable on AI and engineering productivity

Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash

Inside the feature store powering real-time AI in Dropbox Dash

Related topics