Engineering posts about Transformer

Curated summaries and key learnings for engineers working with Transformer.

Pinterest
6m

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models

The article presents a novel approach to enhancing ad relevance by integrating real-time context into sequential recommender models. It highlights the limitations of previous models that relied...

Apple
3m

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

The article discusses a novel approach to Key-Value (KV) caching in transformer language models, focusing on reducing memory footprint while maintaining high throughput during autoregressive...

Google
12m

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

The article discusses advancements in Large Language Model (LLM) inference acceleration through the implementation of block diffusion speculative decoding, specifically the DFlash method, on Google...

Meta (Facebook)
6m

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge

The article outlines a significant overhaul of Facebook Groups Search, transitioning from traditional keyword-based retrieval to a hybrid architecture that incorporates both lexical precision and...

Google
5m

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

The article outlines the challenges of developing production-ready AI agents, particularly focusing on the transition from monolithic architectures to orchestrated sub-agents. It details a case study...

Pinterest
9m

Scaling Recommendation Systems with Request-Level Deduplication

The article discusses the implementation of request-level deduplication in scaling recommendation systems at Pinterest, highlighting its significant impact on storage, training, and serving...

DigitalOcean
8m

Advanced Prompt Caching at Scale

The article explores the complexities of prompt caching in large-scale AI inference systems, highlighting the challenges of maintaining cache efficiency across multiple replicas. It details how...

Apple
3m

Thinking into the Future: Latent Lookahead Training for Transformers

The article presents a novel training strategy called latent lookahead for autoregressive language models, aimed at enhancing their predictive capabilities. Traditional next-token prediction limits...

Apple
2m

Exclusive Self Attention

The article presents exclusive self-attention (XSA), a modification of traditional self-attention (SA) that enhances the performance of Transformers in sequence modeling tasks. By constraining...

Apple
3m

TrajTok: Learning Trajectory Tokens enables better Video Understanding

The article presents TrajTok, an innovative video tokenizer designed to enhance video understanding by dynamically adapting token granularity based on semantic complexity. Unlike traditional methods...

Airbnb
6m

Recommending Travel Destinations to Help Users Explore

The article outlines the development of a destination recommendation model aimed at assisting users during the exploratory phase of trip planning. It highlights the unique challenges of integrating...

Apple
3m

A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation

This article presents a novel system named Cadmus for autoregressive program synthesis, designed to facilitate controlled experimentation in machine learning. The system utilizes an integer virtual...

Dropbox
14m

How low-bit inference enables efficient AI

The article discusses the advancements in large machine learning models and the challenges associated with their deployment, particularly focusing on low-bit inference techniques that enhance...

Apple
3m

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

The article presents the Parallel Track (PT) Transformer, a new architecture designed to enhance the efficiency of large language model (LLM) inference on GPUs. By restructuring computation to...

Databricks
6m

Accelerating Drug Discovery: From FASTA Files to GenAI Insights on Databricks

The article discusses an innovative approach to accelerate drug discovery by processing biological data using Databricks' Lakeflow Declarative Pipelines. It outlines a comprehensive workflow that...

Pinterest
10m

Ads Candidate Generation using Behavioral Sequence Modeling

The article outlines Pinterest's innovative approach to enhancing ad candidate generation through behavioral sequence modeling. By leveraging a transformer-based model, the team predicts user...

Databricks
8m

Exai Bio & Databricks: Accelerating AI-Powered Liquid Biopsy for Early Cancer Detection

The article highlights the collaboration between Exai Bio and Databricks to enhance early cancer detection through generative AI models, Exai-1 and Orion. These models utilize advanced techniques...

Apple
3m

Pretraining with Hierarchical Memories: Separating Long-Tail and Common Knowledge

The article presents a novel approach to enhancing the performance of language models by integrating hierarchical memory architectures. This method allows smaller models to access larger memory...

Apple
3m

Sharp Monocular View Synthesis in Less Than a Second

The article presents SHARP, a novel method for photorealistic view synthesis from a single image, achieving remarkable performance in under a second on standard GPUs. SHARP regresses the parameters...

Snap (Snapchat)
6m

Universal User Modeling (UUM): A Foundation Model for User Understanding at Snapchat

The article discusses Universal User Modeling (UUM) at Snapchat, a foundational model designed to enhance user understanding across various product surfaces. UUM captures user behaviors over time by...