Engineering posts about Computer Vision
Curated summaries and key learnings for engineers working with Computer Vision.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
The article introduces Text-Conditional JEPA (TC-JEPA), a new framework for learning semantically rich visual representations by leveraging image captions to modulate predicted features. This...
What Matters in Practical Learned Image Compression
The article presents a comprehensive study on learned image compression codecs, emphasizing their optimization for the human visual system. It highlights the development of a new codec that...
From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs
The paper introduces the Spatial-Functional Intelligence Benchmark (SFI-Bench), aimed at evaluating the advanced reasoning capabilities of multimodal large language models (MLLMs). It highlights the...
Normalizing Flows with Iterative Denoising
The article presents advancements in Normalizing Flows (NFs) through the introduction of iterative TARFlow (iTARFlow), a generative model that combines autoregressive generation with iterative...
AI Applications: Tools, Use Cases, and Platforms
This guide serves as a practical resource for data leaders and engineers, detailing the landscape of AI applications across various industries. It explores the evolution of AI tools, particularly...
Bootstrapping Sign Language Annotations with Sign Language Models
The article presents a novel approach to enhance sign language annotation through machine learning techniques. It outlines the limitations of current datasets and introduces a pseudo-annotation...
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows
The article introduces STARFlow-V, a novel video generative model that leverages normalizing flows for end-to-end likelihood-based generation. Unlike conventional diffusion-based models, STARFlow-V...
Learning Long-Term Motion Embeddings for Efficient Kinematics Generation
The article presents a novel method for learning long-term motion embeddings aimed at enhancing the efficiency of kinematics generation. By leveraging a highly compressed motion embedding with a...
What Do Your Logits Know? (The Answer May Surprise You!)
The article investigates the implications of probing model internals in vision-language models, revealing how different representational levels can leak information. It highlights the risks...
Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
The article introduces LGTM (Less Gaussians, Texture More), a novel feed-forward framework designed to enhance the scalability of 3D Gaussian Splatting methods for high-resolution image synthesis,...
Drop-In Perceptual Optimization for 3D Gaussian Splatting
The article presents a novel approach to optimizing 3D Gaussian Splatting (3DGS) through perceptual optimization strategies. It highlights the shortcomings of traditional pixel-level loss functions...
SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
The article presents SafetyPairs, a framework designed to isolate safety-critical features in images through counterfactual image generation. This approach addresses the challenge of identifying...
Jump to play: Building with Gemini & MediaPipe
The article explores the integration of Gemini and MediaPipe for developing interactive applications and games that utilize real-time input control. MediaPipe offers a suite of machine learning...
AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval
The article presents AMES (Approximate Multimodal Enterprise Search), a unified architecture for late interaction retrieval that integrates text, image, and video modalities into a shared...
TrajTok: Learning Trajectory Tokens enables better Video Understanding
The article presents TrajTok, an innovative video tokenizer designed to enhance video understanding by dynamically adapting token granularity based on semantic complexity. Unlike traditional methods...
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
The article introduces RubiCap, a novel framework that leverages rubric-guided reinforcement learning to enhance dense image captioning. It addresses the challenges of generating high-quality...
LiTo: Surface Light Field Tokenization
The article presents 'LiTo: Surface Light Field Tokenization', a research paper that introduces a 3D latent representation designed to jointly model object geometry and view-dependent appearance....
Flow Matching with Semidiscrete Couplings
The article presents a novel approach to flow matching using semidiscrete couplings, addressing limitations in traditional optimal transport methods. It highlights the inefficiencies of the OT flow...
Multi-Frequency Fusion for Robust Video Face Forgery Detection
The article presents a novel approach to video face forgery detection through a method termed Multi-Frequency Fusion. This technique utilizes a lightweight fusion of two handcrafted cues,...
A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning
The A.R.I.S. (Automated Recycling Identification System) is a novel approach to e-waste classification that leverages deep learning techniques to enhance material recovery from electronic waste. By...