Engineering posts about Transformer
Curated summaries and key learnings for engineers working with Transformer.
Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models
The article presents a novel approach to enhancing ad relevance by integrating real-time context into sequential recommender models. It highlights the limitations of previous models that relied...
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
The article discusses a novel approach to Key-Value (KV) caching in transformer language models, focusing on reducing memory footprint while maintaining high throughput during autoregressive...
Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding
The article discusses advancements in Large Language Model (LLM) inference acceleration through the implementation of block diffusion speculative decoding, specifically the DFlash method, on Google...
Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge
The article outlines a significant overhaul of Facebook Groups Search, transitioning from traditional keyword-based retrieval to a hybrid architecture that incorporates both lexical precision and...
Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith
The article outlines the challenges of developing production-ready AI agents, particularly focusing on the transition from monolithic architectures to orchestrated sub-agents. It details a case study...
Scaling Recommendation Systems with Request-Level Deduplication
The article discusses the implementation of request-level deduplication in scaling recommendation systems at Pinterest, highlighting its significant impact on storage, training, and serving...
Advanced Prompt Caching at Scale
The article explores the complexities of prompt caching in large-scale AI inference systems, highlighting the challenges of maintaining cache efficiency across multiple replicas. It details how...
Thinking into the Future: Latent Lookahead Training for Transformers
The article presents a novel training strategy called latent lookahead for autoregressive language models, aimed at enhancing their predictive capabilities. Traditional next-token prediction limits...
Exclusive Self Attention
The article presents exclusive self-attention (XSA), a modification of traditional self-attention (SA) that enhances the performance of Transformers in sequence modeling tasks. By constraining...
TrajTok: Learning Trajectory Tokens enables better Video Understanding
The article presents TrajTok, an innovative video tokenizer designed to enhance video understanding by dynamically adapting token granularity based on semantic complexity. Unlike traditional methods...
Recommending Travel Destinations to Help Users Explore
The article outlines the development of a destination recommendation model aimed at assisting users during the exploratory phase of trip planning. It highlights the unique challenges of integrating...
A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation
This article presents a novel system named Cadmus for autoregressive program synthesis, designed to facilitate controlled experimentation in machine learning. The system utilizes an integer virtual...
How low-bit inference enables efficient AI
The article discusses the advancements in large machine learning models and the challenges associated with their deployment, particularly focusing on low-bit inference techniques that enhance...
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
The article presents the Parallel Track (PT) Transformer, a new architecture designed to enhance the efficiency of large language model (LLM) inference on GPUs. By restructuring computation to...
Accelerating Drug Discovery: From FASTA Files to GenAI Insights on Databricks
The article discusses an innovative approach to accelerate drug discovery by processing biological data using Databricks' Lakeflow Declarative Pipelines. It outlines a comprehensive workflow that...
Ads Candidate Generation using Behavioral Sequence Modeling
The article outlines Pinterest's innovative approach to enhancing ad candidate generation through behavioral sequence modeling. By leveraging a transformer-based model, the team predicts user...
Exai Bio & Databricks: Accelerating AI-Powered Liquid Biopsy for Early Cancer Detection
The article highlights the collaboration between Exai Bio and Databricks to enhance early cancer detection through generative AI models, Exai-1 and Orion. These models utilize advanced techniques...
Pretraining with Hierarchical Memories: Separating Long-Tail and Common Knowledge
The article presents a novel approach to enhancing the performance of language models by integrating hierarchical memory architectures. This method allows smaller models to access larger memory...
Sharp Monocular View Synthesis in Less Than a Second
The article presents SHARP, a novel method for photorealistic view synthesis from a single image, achieving remarkable performance in under a second on standard GPUs. SHARP regresses the parameters...
Universal User Modeling (UUM): A Foundation Model for User Understanding at Snapchat
The article discusses Universal User Modeling (UUM) at Snapchat, a foundational model designed to enhance user understanding across various product surfaces. UUM captures user behaviors over time by...