Engineering posts about Transformer

Curated summaries and key learnings for engineers working with Transformer.

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models

The article presents a novel approach to enhancing ad relevance by integrating real-time context into sequential recommender models. It highlights the limitations of previous models that relied...

Apple

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

The article discusses a novel approach to Key-Value (KV) caching in transformer language models, focusing on reducing memory footprint while maintaining high throughput during autoregressive...

Google

12m

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

The article discusses advancements in Large Language Model (LLM) inference acceleration through the implementation of block diffusion speculative decoding, specifically the DFlash method, on Google...

Meta (Facebook)

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge

The article outlines a significant overhaul of Facebook Groups Search, transitioning from traditional keyword-based retrieval to a hybrid architecture that incorporates both lexical precision and...

Google

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

The article outlines the challenges of developing production-ready AI agents, particularly focusing on the transition from monolithic architectures to orchestrated sub-agents. It details a case study...

Scaling Recommendation Systems with Request-Level Deduplication

The article discusses the implementation of request-level deduplication in scaling recommendation systems at Pinterest, highlighting its significant impact on storage, training, and serving...

DigitalOcean

Advanced Prompt Caching at Scale

The article explores the complexities of prompt caching in large-scale AI inference systems, highlighting the challenges of maintaining cache efficiency across multiple replicas. It details how...

Apple

Thinking into the Future: Latent Lookahead Training for Transformers

The article presents a novel training strategy called latent lookahead for autoregressive language models, aimed at enhancing their predictive capabilities. Traditional next-token prediction limits...

Apple

Exclusive Self Attention

The article presents exclusive self-attention (XSA), a modification of traditional self-attention (SA) that enhances the performance of Transformers in sequence modeling tasks. By constraining...

Apple

TrajTok: Learning Trajectory Tokens enables better Video Understanding

The article presents TrajTok, an innovative video tokenizer designed to enhance video understanding by dynamically adapting token granularity based on semantic complexity. Unlike traditional methods...

Airbnb

Recommending Travel Destinations to Help Users Explore

The article outlines the development of a destination recommendation model aimed at assisting users during the exploratory phase of trip planning. It highlights the unique challenges of integrating...

Apple

A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation

This article presents a novel system named Cadmus for autoregressive program synthesis, designed to facilitate controlled experimentation in machine learning. The system utilizes an integer virtual...

Dropbox

14m

Engineering posts about Transformer

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

Scaling Recommendation Systems with Request-Level Deduplication

Advanced Prompt Caching at Scale

Thinking into the Future: Latent Lookahead Training for Transformers

Exclusive Self Attention

TrajTok: Learning Trajectory Tokens enables better Video Understanding

Recommending Travel Destinations to Help Users Explore

A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation

How low-bit inference enables efficient AI

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Accelerating Drug Discovery: From FASTA Files to GenAI Insights on Databricks

Ads Candidate Generation using Behavioral Sequence Modeling

Exai Bio & Databricks: Accelerating AI-Powered Liquid Biopsy for Early Cancer Detection

Pretraining with Hierarchical Memories: Separating Long-Tail and Common Knowledge

Sharp Monocular View Synthesis in Less Than a Second

Universal User Modeling (UUM): A Foundation Model for User Understanding at Snapchat