Engineering posts about Computer Vision

Curated summaries and key learnings for engineers working with Computer Vision.

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

The article introduces Text-Conditional JEPA (TC-JEPA), a new framework for learning semantically rich visual representations by leveraging image captions to modulate predicted features. This...

Apple

What Matters in Practical Learned Image Compression

The article presents a comprehensive study on learned image compression codecs, emphasizing their optimization for the human visual system. It highlights the development of a new codec that...

Apple

From Where Things Are to What They’re For: Benchmarking Spatial–Functional Intelligence for Multimodal LLMs

The paper introduces the Spatial-Functional Intelligence Benchmark (SFI-Bench), aimed at evaluating the advanced reasoning capabilities of multimodal large language models (MLLMs). It highlights the...

Apple

Normalizing Flows with Iterative Denoising

The article presents advancements in Normalizing Flows (NFs) through the introduction of iterative TARFlow (iTARFlow), a generative model that combines autoregressive generation with iterative...

Databricks

23m

AI Applications: Tools, Use Cases, and Platforms

This guide serves as a practical resource for data leaders and engineers, detailing the landscape of AI applications across various industries. It explores the evolution of AI tools, particularly...

Apple

Bootstrapping Sign Language Annotations with Sign Language Models

The article presents a novel approach to enhance sign language annotation through machine learning techniques. It outlines the limitations of current datasets and introduces a pseudo-annotation...

Apple

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

The article introduces STARFlow-V, a novel video generative model that leverages normalizing flows for end-to-end likelihood-based generation. Unlike conventional diffusion-based models, STARFlow-V...

Apple

Learning Long-Term Motion Embeddings for Efficient Kinematics Generation

The article presents a novel method for learning long-term motion embeddings aimed at enhancing the efficiency of kinematics generation. By leveraging a highly compressed motion embedding with a...

Apple

What Do Your Logits Know? (The Answer May Surprise You!)

The article investigates the implications of probing model internals in vision-language models, revealing how different representational levels can leak information. It highlights the risks...

Apple

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

The article introduces LGTM (Less Gaussians, Texture More), a novel feed-forward framework designed to enhance the scalability of 3D Gaussian Splatting methods for high-resolution image synthesis,...

Apple

Drop-In Perceptual Optimization for 3D Gaussian Splatting

The article presents a novel approach to optimizing 3D Gaussian Splatting (3DGS) through perceptual optimization strategies. It highlights the shortcomings of traditional pixel-level loss functions...

Apple

SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

The article presents SafetyPairs, a framework designed to isolate safety-critical features in images through counterfactual image generation. This approach addresses the challenge of identifying...

Google

Jump to play: Building with Gemini & MediaPipe

The article explores the integration of Gemini and MediaPipe for developing interactive applications and games that utilize real-time input control. MediaPipe offers a suite of machine learning...

Apple

AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval

The article presents AMES (Approximate Multimodal Enterprise Search), a unified architecture for late interaction retrieval that integrates text, image, and video modalities into a shared...

Apple

TrajTok: Learning Trajectory Tokens enables better Video Understanding

The article presents TrajTok, an innovative video tokenizer designed to enhance video understanding by dynamically adapting token granularity based on semantic complexity. Unlike traditional methods...

Apple

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

The article introduces RubiCap, a novel framework that leverages rubric-guided reinforcement learning to enhance dense image captioning. It addresses the challenges of generating high-quality...

Apple

LiTo: Surface Light Field Tokenization

The article presents 'LiTo: Surface Light Field Tokenization', a research paper that introduces a 3D latent representation designed to jointly model object geometry and view-dependent appearance....

Apple

Flow Matching with Semidiscrete Couplings

The article presents a novel approach to flow matching using semidiscrete couplings, addressing limitations in traditional optimal transport methods. It highlights the inefficiencies of the OT flow...

Apple

Multi-Frequency Fusion for Robust Video Face Forgery Detection

The article presents a novel approach to video face forgery detection through a method termed Multi-Frequency Fusion. This technique utilizes a lightweight fusion of two handcrafted cues,...

Apple

A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning

The A.R.I.S. (Automated Recycling Identification System) is a novel approach to e-waste classification that leverages deep learning techniques to enhance material recovery from electronic waste. By...