March 4 – March 11, 2026

Computer Vision - Weekly Roundup

100 papers published across 7 labs.

10% acceleration

Selected Labs publishing this week

Tsinghua AI4 Google Research2 BAIR1 UW1 DeepMind1

Top Papers

Mar 11, 2026

3w ago·also Huawei, SUSTech, UAlberta

Edge-Assisted Multi-Robot Visual-Inertial SLAM With Efficient Communication

Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.

Xin Liu, Shuhuan Wen, Jing Zhao +249

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

3w ago

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.

Qingtao Pan, Zhihao Dou, Shuo Li

Computer Vision Multimodal Models Training Efficiency & Optimization

3w ago

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.

Wenxuan Ma, Chaofan Zhang, Yinghao Cai +3

Computer Vision Multimodal Models Robotics & Embodied AI

Zirui Zhang +33w ago

TacLoc: Global Tactile Localization on Objects from a Registration Perspective

Ditch the pre-trained models: TacLoc achieves accurate robotic pose estimation from tactile sensing alone by framing it as a one-shot point cloud registration problem.

Zirui Zhang, Boyang Zhang, Fumin Zhang +1

Computer Vision Robotics & Embodied AI

Yongpeng Yan +33w ago

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.

Yongpeng Yan, Yanan Li, Qiyang Xiao +1

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

All Papers (100)

Mar 11, 2026

3w ago

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.

Qingtao Pan, Zhihao Dou, Shuo Li

Computer Vision Multimodal Models Training Efficiency & Optimization

3w ago

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.

Wenxuan Ma, Chaofan Zhang, Yinghao Cai +3

Computer Vision Multimodal Models Robotics & Embodied AI

Zirui Zhang +33w ago

TacLoc: Global Tactile Localization on Objects from a Registration Perspective

Ditch the pre-trained models: TacLoc achieves accurate robotic pose estimation from tactile sensing alone by framing it as a one-shot point cloud registration problem.

Zirui Zhang, Boyang Zhang, Fumin Zhang +1

Computer Vision Robotics & Embodied AI

Yongpeng Yan +33w ago

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.

Yongpeng Yan, Yanan Li, Qiyang Xiao +1

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also Google Research

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.

Yan-Bo Lin, Jonah Casebeer, Long Mai +3

Computer Vision Multimodal Models Speech & Audio

3w ago

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

Self-supervised learning can dramatically improve online HD map construction, outperforming supervised methods even with limited labeled data by leveraging geospatial consistency in BEV feature representations.

Jonas Merkert, Alexander Blumberg, Jan-Hendrik Pauls +1

Computer Vision Robotics & Embodied AI

Jiarui Yang3w ago

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

VLA-controlled robots can now detect anomalies in under 100ms using a plug-and-play module, enabling real-time recovery from unexpected situations.

Jiarui Yang

Computer Vision Multimodal Models Robotics & Embodied AI

Minsak Nanang +23w ago

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.

Minsak Nanang, Adrian Hilton, Armin Mustafa

Computer Vision Data Curation & Synthetic Data Multimodal Models

3w ago·also Texas A&M

ResWM: Residual-Action World Model for Visual RL

Stop wrestling with unstable action spaces: ResWM reframes visual RL by predicting incremental action adjustments, leading to smoother control and better performance.

Jseen Zhang, Gabriel Adineera, Jinzhou Tan +1

Computer Vision Robotics & Embodied AI World Models & Planning

3w ago

Learning Bimanual Cloth Manipulation with Vision-based Tactile Sensing via Single Robotic Arm

Unlock bimanual-level cloth manipulation with a single robotic arm using a novel tactile gripper and vision-based perception framework.

Dongmyoung Lee, Wei Chen, Xiaoshuai Chen +2

Computer Vision Robotics & Embodied AI

Stefanos Pasios +13w ago

HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

Achieve real-time photorealistic image enhancement without sacrificing visual quality or semantic consistency, thanks to a novel hybrid training strategy for GANs.

Stefanos Pasios, Nikos Nikolaidis

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

BAIR3w ago

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Ditch the clunky controllers: this hand-shadowing pipeline lets you teleoperate a robot arm with just an RGB-D camera and some clever inverse kinematics.

Hendrik Chiche, Antoine Jamme, Trevor Rigoberto Martinez

Computer Vision Robotics & Embodied AI

3w ago

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Diffusion Transformers can be accelerated by up to 7x with nearly lossless performance using a training-free method that selectively computes on sparse anchor tokens, outperforming existing temporal acceleration techniques.

Wenhao Sun, Zhaoqiang Liu

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

3w ago·also Department of Radiation Oncology, Institute for AI and Data Science, Wayne State University

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

LVLMs can now provide depth-aware pedestrian navigation guidance by grounding language reasoning and segmentation, without needing user-provided cues or anchor points.

R. Sultan, Hui Zhu, Xiangyu Zhou +4

Computer Vision Multimodal Models Robotics & Embodied AI

Jonathan Cox +43w ago

Semantic Landmark Particle Filter for Robot Localisation in Vineyards

Robots lost in the vineyard? Not anymore: encoding row-level semantics into a particle filter enables robust localization in repetitive agricultural environments where LiDAR and vision alone fail.

Jonathan Cox, James R. Heselden, Marija Popovi'c +2

Computer Vision Robotics & Embodied AI

Itsuki Hirako +53w ago

ScanDP: Generalizable 3D Scanning with Diffusion Policy

Forget training on massive datasets: this new diffusion policy learns human-like 3D scanning strategies that generalize to unseen objects while being robust to noise.

Itsuki Hirako, R. Hakoda, Yubin Liu +3

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization

3w ago

Geometric Autoencoder for Diffusion Models

Ditch the heuristic latent spaces: Geometric Autoencoders offer a principled way to inject VFM priors into diffusion models, yielding state-of-the-art image generation with better compression and semantic depth.

Hangyu Liu, Jianyong Wang, Yutao Sun

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Rokuto Nagata +43w ago

D-SLAMSpoof: An Environment-Agnostic LiDAR Spoofing Attack using Dynamic Point Cloud Injection

Even in feature-rich environments, LiDAR SLAM systems are vulnerable to a new spoofing attack (D-SLAMSpoof) that injects dynamically coordinated spurious point clouds, but can be defended against using inertial dead reckoning.

Rokuto Nagata, Kenji Koide, Kazuma Ikeda +2

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Zixuan Chen +83w ago

AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

Robots can now adaptively decide whether to clear clutter or directly grasp, leading to significantly improved success rates in densely cluttered environments.

Zixuan Chen, Wenquan Zhang, Jing Fang +6

Computer Vision Robotics & Embodied AI Tool Use & Agents

Niusha Khosravi +23w ago

Distributed Kalman--Consensus Filtering with Adaptive Uncertainty Weighting for Multi-Object Tracking in Mobile Robot Networks

By adaptively weighting neighbor information based on uncertainty, distributed multi-object tracking can achieve significantly better performance in mobile robot networks with heterogeneous localization quality.

Niusha Khosravi, R. Ventura, M. Basiri

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

UW3w ago

COMIC: Agentic Sketch Comedy Generation

AI can now (almost) write and direct Saturday Night Live.

Computer Vision Multimodal Models Tool Use & Agents

3w ago·also BGI Research

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.

Daxiang Dong, Mingming Zheng, Dong Xu +17

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Achieve 2.5x higher success in UAV navigation by decoupling target generation from progress monitoring, enabling safer and more efficient zero-shot flight.

Guiyong Zheng, Y. Ban, Mingjie Zhang +2

Computer Vision Multimodal Models Robotics & Embodied AI

DeepMind3w ago

Taking Shortcuts for Categorical VQA Using Super Neurons

Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.

Pierre Musacchio, Jaeyi Jeong, Dahun Kim

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Sunpill Kim +33w ago

Na\"ive Exposure of Generative AI Capabilities Undermines Deepfake Detection

Generative AI's ability to reason about and refine images based on authenticity criteria inadvertently creates a powerful evasion strategy that renders current deepfake detectors ineffective.

Sunpill Kim, Chan-Hue Hwang, Minsu Kim +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

3w ago

P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

Jointly training layered Gaussian splats boosts reconstruction quality by up to 2.6 dB, proving that coordinating optimization across layers is key for progressive 2D Gaussian splatting.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Lianjie Ma +53w ago

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

Monocular depth estimation can now run at 161 FPS on edge devices without sacrificing too much accuracy, thanks to a clever asynchronous architecture that reuses features from a foundation model.

Lianjie Ma, Yuquan Li, Bi-Ye Jiang +3

Computer Vision Inference & Quantization Robotics & Embodied AI

3w ago·also Huawei, SUSTech, UAlberta

Edge-Assisted Multi-Robot Visual-Inertial SLAM With Efficient Communication

Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.

Xin Liu, Shuhuan Wen, Jing Zhao +249

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

S. Song +33w ago

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

A training-free visual distillation method boosts VLA model performance in cluttered environments by over 34%, proving that targeted noise reduction is more effective than brute-force scaling.

S. Song, S. Kodagoda, Marc Carmichael +1

Computer Vision Multimodal Models Robotics & Embodied AI

Google Research3w ago·also Columbia, UMich

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.

Tianyu Xu, Sieun Kim, Qianhuizhi Zheng +6

Computer Vision Multimodal Models Speech & Audio

3w ago·also ETH

Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Ditch the slow diffusion grind: Marigold-SSD delivers zero-shot depth completion in a single step, rivaling discriminative models in speed while retaining diffusion's accuracy.

Computer Vision Inference & Quantization Training Efficiency & Optimization

Sengim Karayalcin +33w ago

Backdoor Directions in Vision Transformers

Backdoor triggers in ViTs leave a surprisingly clear signature: a linear direction in activation space that can be directly manipulated to activate or deactivate the backdoor.

Sengim Karayalcin, Marina Krček, Pin-Yu Chen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Red-Teaming & Adversarial Robustness

3w ago

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.

Yuhong Dai, Yanlin Lai, Mitt Huang +9

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 10, 2026

Binyuan Huang +73w ago

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

Bypass the need for extensive on-site data collection when deploying pre-trained robot models by visually prompting them to adapt to new scenes.

Binyuan Huang, Yuqing Wen, Yucheng Zhao +5

Computer Vision Robotics & Embodied AI

3w ago

$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

Autonomous vehicles can now better "see" the world even when cameras fail, thanks to a new method that fills in the blanks by leveraging spatial overlaps and learned semantic priors.

Kaixin Lin, Di Wen, Yufan Chen +1

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model

Skip expensive manual annotation: this method extracts accurate 3D UAV trajectories and classifications directly from readily available internet videos.

Haoxiang Lei, Daotong Wang, Shenghai Yuan +1

Computer Vision Multimodal Models Natural Language Processing

3w ago

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.

Jiazhi Guan, Quanwei Yang, Luying Huang +5

Computer Vision Multimodal Models World Models & Planning

Nankai University3w ago·also COWARobot Co. Ltd, Melbourne, NKIARI, TU Munich

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

By converting point clouds into a format VLMs can understand, VLM-Loc significantly boosts text-to-point-cloud localization accuracy, outperforming prior methods that rely on shallower text-point cloud correspondences.

Shuhao Kang, Youqi Liao, Peijie Wang +4

Computer Vision Multimodal Models Robotics & Embodied AI

⋆ Primaa3w ago

Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading

Disagreement between pathologists, quantified as "Whole Slide Difficulty," can be leveraged to significantly boost the accuracy of AI Gleason grading, particularly for challenging cases.

Marie Arrivat, Rémy Peyret, Elsa Angelini +1

Computer Vision Scientific Discovery & Drug Design

3w ago

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.

Yuchen Yang, Yuqing Shao, Duxiu Huang +14

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tom Wehrbein +13w ago·also L3S -Leibniz University Hannover

Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture

Fine-grained foot motion capture, a notoriously hard problem, gets a 30% accuracy boost by cleverly lifting 2D keypoints to 3D using motion capture data and contextual information, bypassing the need for direct image-3D annotation pairs.

Tom Wehrbein, Bodo Rosenhahn

Computer Vision Robotics & Embodied AI

Tengjin Weng +33w ago

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.

Tengjin Weng, Jingyi Wang, Lin Ma +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation

Forget manual labeling: STONE offers a massive, automatically-labeled dataset for off-road robot navigation, unlocking scalable training for robust 3D traversability prediction.

Konyul Park, Daehun Kim, Jiyong Oh +6

Computer Vision Multimodal Models Robotics & Embodied AI

Erkan Turan +13w ago

Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Generative drifting's empirical success is no longer a mystery: it's secretly score matching, but with frequency-dependent convergence bottlenecks that explain the preference for Laplacian kernels.

Erkan Turan, Maks Ovsjanikov

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Shilei Wang +43w ago

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

Achieve SOTA multi-modal object tracking by adaptively fusing modalities with a Mixture of Experts and decoupling temporal propagation with separate State Space Models.

Shilei Wang, Pujian Lai, Dong Gao +2

Computer Vision Multimodal Models

Nguyen Tuan Kiet +23w ago

Predictive Spectral Calibration for Source-Free Test-Time Regression

Source-free test-time adaptation for image regression gets a boost with Predictive Spectral Calibration, which aligns target features within the source predictive support and calibrates residual spectral slack, leading to significant performance gains under distribution shifts.

Nguyen Tuan Kiet, Huynh Thanh Trung, Pham Huy Hieu

Computer Vision Training Efficiency & Optimization

Shuang Liu +63w ago

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

By explicitly bridging the gap between on-body appearances and flat layouts, BridgeDiff achieves state-of-the-art virtual try-off results, generating more realistic and structurally sound flat-garment representations.

Shuang Liu, Ao Yu, Linkang Cheng +4

Computer Vision Multimodal Models

3w ago

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Unlock real-time semantic SLAM and multimodal interaction with 3D Gaussian Splatting using X-GS, a unified and extensible open framework.

Yueen Ma, Irwin King

Computer Vision Multimodal Models Robotics & Embodied AI

Enming Zhang +23w ago

Evolving Prompt Adaptation for Vision-Language Models

Steer clear of catastrophic forgetting in VLMs with EvoPrompt, a new method that evolves prompts by preserving learned semantic directions while adapting their magnitude.

Enming Zhang, Jiayang Li, Zhenyu Liu

Computer Vision Multimodal Models Training Efficiency & Optimization

Yanshan Li +33w ago

M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

State-of-the-art skeleton-based action recognition is now possible through a game-theoretic contrastive learning framework that maximizes action-relevant information while minimizing encoding redundancy.

Yanshan Li, Ke Ma, Miaomiao Wei +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Yunnan Normal University3w ago·also CAS

ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts

Large models are emerging as a promising new paradigm for translating complex-layout document images, as shown by the ICDAR 2025 DIMT competition.

Yaping Zhang, Yupu Liang, Lu Xiang +2

Computer Vision Multimodal Models Natural Language Processing

Chaodong Xiao +13w ago

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.

Chaodong Xiao, Zhengqiang Zhang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Tran Bao Sam +53w ago

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

By explicitly modeling how abnormalities relate within and across different medical image views, GIIM achieves significantly higher diagnostic accuracy and robustness, even with incomplete data.

Tran Bao Sam, Hung Vu, Dao Trung Kien +3

Computer Vision Multimodal Models Scientific Discovery & Drug Design

3w ago

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.

Zhiye Tang, Qiudan Zhang, Junhui Hou +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

3w ago

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

A 4B-parameter model outperforms Gemini-3-Pro in autonomous driving by incorporating physics-informed constraints and style-aware training, suggesting specialized models can surpass larger, general-purpose models in domain-specific tasks.

Yuan Gao, Dengyuan Hua, Mattia Piccinini +5

Computer Vision Multimodal Models Robotics & Embodied AI

Artemis Shaw +73w ago

Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation

A complete, GPU-accelerated bimanual mobile manipulation platform can be built for under $1300, opening up robotics research and education to a wider audience.

Artemis Shaw, Chen Liu, Justin Costa +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

S. M. A. Sharif +33w ago

Decoder-Free Distillation for Quantized Image Restoration

Achieve near-FP32 image restoration performance with an Int8 model that runs at 442 FPS on NVIDIA Jetson Orin, all thanks to a quantization-aware distillation framework that avoids decoder distillation.

S. M. A. Sharif, Abdur Rehman, Seongwan Kim +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

3w ago

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.

Ronghao Fu, Haoran Liu, Weijie Zhang +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Forget blurry sketch-to-image outputs: this method uses component-aware self-attention and coordinate-preserving fusion to generate photorealistic images with unprecedented fidelity and spatial accuracy.

Ali Zia, Muhammad Umer Ramzan, Usman Ali +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Universidad de Zaragoza–I3A3w ago·also PoliMi, UW-Madison

A comprehensive study of time-of-flight non-line-of-sight imaging

Despite diverse formulations, ToF NLOS imaging methods hit similar performance walls in resolution and noise sensitivity when hardware is held constant, suggesting diminishing returns from algorithmic improvements alone.

Julio Marco, Adrian Jarabo, Ji Hyun Nam +3

Computer Vision

Yanxin Li +23w ago

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

By computing the *difference* between attention maps, DCAU-Net achieves state-of-the-art medical image segmentation while dramatically reducing computational cost compared to standard self-attention.

Yanxin Li, Hui Wan, Libin Lan

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Zheng Fang +63w ago

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

By incorporating language guidance into federated learning, SurgFed tackles the long-standing problem of tissue and task heterogeneity in surgical video understanding, leading to improved segmentation and depth estimation across diverse surgical settings.

Zheng Fang, Ziwei Niu, Ziyue Wang +4

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

3w ago

MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction

Finally, a GelSight-style sensor that doesn't force you to choose between pre-contact vision and high-fidelity tactile sensing.

Zhixian Hu, Zhengtong Xu, Sheeraz Athar

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

TopoOR: A Unified Topological Scene Representation for the Operating Room

Ditch the flat scene graphs: TopoOR models surgical environments as higher-order topological structures, unlocking superior performance in safety-critical tasks by preserving complex relationships and multimodal data.

Tony Danjun Wang, Ka Young Kim, Tolga Birdal +2

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

Precisely steer text-to-image generation along cognitive dimensions like valence and memorability with CogBlender, a framework that lets you dial in psychological intent.

Shengqi Dang, Jiaying Lei, Ziqing Qian +1

Computer Vision Multimodal Models

Jiarun Song +33w ago

From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing

Latency in VR conferencing hurts social presence, but this study quantifies the perceptual and cognitive mechanisms at play to guide system optimization.

Jiarun Song, Ninghao Wan, FuZheng Yang +1

Computer Vision Natural Language Processing

Tsinghua AI3w ago·also Artificial Intelligence Thrust, Beijing National Research Center for Infor-, CAS

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.

Yinrui Ren, Jinjing Zhu, Zhuoxiao Li +7

Computer Vision Inference & Quantization Multimodal Models

Jiarun Song +33w ago

TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration

Task demands in remote AR collaboration dictate how much network delay users can tolerate before perceived fluency breaks down, paving the way for adaptive systems.

Jiarun Song, Ninghao Wan, FuZheng Yang +1

Computer Vision Natural Language Processing Robotics & Embodied AI

3w ago·also SYSU

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Unlock the power of web videos for embodied AI: implicit geometry representations let agents learn to navigate from real-world room tours without relying on fragile 3D reconstruction.

Mingfei Han, Haihong Hao, Liang Ma +5

Computer Vision Multimodal Models Robotics & Embodied AI

Md Selim Sarowar +23w ago

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

By representing visual inputs as 3D Gaussian primitives, GST-VLA unlocks a new level of geometric understanding for vision-language-action models, leading to substantial performance gains in robotic manipulation tasks.

Md Selim Sarowar, Omer Tariq, Sungho Kim

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

From Verification to Amplification: Auditing Reverse Image Search as Algorithmic Gatekeeping in Visual Misinformation Fact-checking

Reverse image search, a key tool for visual fact-checking, often amplifies misinformation and irrelevant content, burying debunking information.

Cong Lin, Yifei Chen, Jiangyue Chen +3

Computer Vision Multimodal Models Recommendation & Information Retrieval

3w ago

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

ConvNets strike back: a ConvNeXt-based diffusion model matches Transformer performance at half the FLOPs and 7x faster training, all on just 4 GPUs.

Taesung Kwon, Lorenzo Bianchi, Lennart Wittke +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

3w ago·also Oxford, Sydney

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

Achieve real-time super-resolution ultrasound without labeled data using CycleULM, a CycleGAN-based framework that boosts image contrast by 15.3 dB and localization precision by 46%.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung +11

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Chang-Yong Song +13w ago

On the Structural Failure of Chamfer Distance in 3D Shape Optimization

Chamfer distance, the workhorse loss for point cloud tasks, can actually *increase* when you optimize it, unless you use non-local coupling to avoid gradient collapse.

Chang-Yong Song, David Hyde

Computer Vision Training Efficiency & Optimization

3w ago

A Text-Native Interface for Generative Video Authoring

Imagine writing a script and instantly seeing it come to life – Doki makes generative video authoring as intuitive as writing a text document.

Xingyu Bruce Liu, Mira Dontcheva, Dingzeyu Li

Computer Vision Multimodal Models Natural Language Processing

Kanishkha Jaisankar +43w ago

Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning

Combining pre-trained and custom neural networks with data augmentation and transfer learning yields a robust autonomous driving system capable of accurately perceiving and reacting to its environment.

Kanishkha Jaisankar, P. Pawar, Diana Susane Joseph +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apple ML3w ago

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.

Aviad Dahan, Moran Yanuka, Noa Kraicer +2

Computer Vision Multimodal Models Speech & Audio

Yanan Li3w ago

Robust Provably Secure Image Steganography via Latent Iterative Optimization

Provably secure steganography can now withstand real-world image compression and processing thanks to a clever latent-space optimization technique.

Yanan Li

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

A. Assadi +23w ago

Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs

Bridge the gap between sparse core samples and continuous wellbore data with a cGAN that synthesizes realistic subsurface images conditioned on well log porosity.

A. Assadi, B. Bennett, A. Rabbani

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Yoon Jo Kim +93w ago·also Co-corresponding authors

A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation

Forget retraining: this guideline-aware AI agent instantly adapts to new radiotherapy protocols, outperforming supervised models in clinical preference.

Yoon Jo Kim, Wonyoung Cho, Jongmin Lee +7

Computer Vision Scientific Discovery & Drug Design Tool Use & Agents

3w ago

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

Reconstructing and simulating wind-driven dynamics from video is now possible with a new differentiable framework that enforces fluid dynamics laws.

Yuanhang Lei, Boming Zhao, Zesong Yang +8

Computer Vision Robotics & Embodied AI World Models & Planning

3w ago·also Shenzhen University

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Panoramic vision-language models can achieve a level of holistic scene understanding and robustness in adverse conditions that's impossible for traditional pinhole-based VLMs.

Weijia Fan, Ruiping Liu, Jiale Wei +6

Computer Vision Multimodal Models Natural Language Processing

Tsinghua AI3w ago·also PKU

Video-Based Reward Modeling for Computer-Use Agents

A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.

Linxin Song, Jieyu Zhang, Huanxin Sheng +6

Computer Vision RLHF & Preference Learning Tool Use & Agents

Zhaofeng Shi +43w ago

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

Adapt your action anticipation model on-the-fly to new viewpoints (egocentric or exocentric) with a novel test-time adaptation method that leverages multi-label prototype growing and dual-clue consistency.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang +2

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

ProGS: Towards Progressive Coding for 3D Gaussian Splatting

Achieve 45x compression of 3D Gaussian Splatting data while *improving* visual fidelity by over 10% with a streaming-friendly octree-based codec.

Zhiye Tang, Lingzhuo Liu, Shengjie Jiao +4

Computer Vision Inference & Quantization

3w ago

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Ditch global embeddings for text-motion retrieval: this method uses joint-angle motion images and token-patch late interaction to achieve state-of-the-art accuracy and interpretability.

Yao Zhang, Zhuchenyang Liu, Yanlan He +2

Computer Vision Multimodal Models Recommendation & Information Retrieval

George Mason University3w ago

VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM

By explicitly modeling per-splat appearance variance, VarSplat enables more robust 3D Gaussian Splatting SLAM, particularly in low-texture or reflective environments where existing methods struggle.

Anh Thuan Tran, Jana Kosecka

Computer Vision Robotics & Embodied AI

Lifeng Zhuo +13w ago

RESBev: Making BEV Perception More Robust

A plug-and-play module, RESBev, fortifies BEV perception against sensor degradation and adversarial attacks by learning latent BEV state transitions, offering a practical route to more reliable autonomous driving systems.

Lifeng Zhuo, Kefan Jin

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Shahab Aslani +173w ago

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

Worsening of a specific lung abnormality called PPFE, easily measurable on routine lung cancer screening CT scans, strongly predicts earlier death and respiratory problems.

Shahab Aslani, Mehran Azimbagirad, Daryl Cheng +15

Computer Vision Scientific Discovery & Drug Design

Tsinghua AI3w ago·also Tencent AI

RiO-DETR: DETR for Real-time Oriented Object Detection

RiO-DETR makes real-time oriented object detection with transformers a reality by cleverly decoupling angle estimation and injecting angular diversity into dense supervision.

Zhangchi Hu, Yifan Zhao, Yansong Peng +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

University of Zaragoza3w ago·also Department of Systems Engineering and Computer, University of Torino

TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

Existing vision-language models fall flat when it comes to spotting time-dependent robot errors, but TIMID nails it with weak supervision and a clever VAD architecture.

Nerea Gallego, Fernando Salanova, Claudio Mannarano +2

Computer Vision Robotics & Embodied AI

Bohao Li +33w ago

CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation

By explicitly modeling and mitigating the confounding effects of visual context, CIGPose achieves state-of-the-art whole-body pose estimation, outperforming previous methods even without relying on extra training data.

Bohao Li, Zhicheng Cao, Huixian Li +1

Computer Vision Robotics & Embodied AI

Xiaotian Hu +113w ago

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents leapfrogs existing fetal ultrasound analysis tools by dynamically orchestrating specialized AI agents, outperforming monolithic models across diverse clinical tasks and delivering structured clinical reports from video streams.

Xiaotian Hu, Junwei Huang, Mingxuan Liu +9

Computer Vision Multimodal Models Tool Use & Agents

Siqi Pei +23w ago

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

DRIFT achieves state-of-the-art object detection performance on 4D radar point clouds by fusing local and global contexts with a novel dual-representation transformer architecture.

Siqi Pei, Andras Palffy, Dariu M. Gavrila

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

3w ago

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.

Liudi Yang, George Eskandar, Fengyi Shen +3

Computer Vision Multimodal Models

3w ago

OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty

Ditch brittle point-guided line matching: this VIO system uses optimal transport on learned line descriptors for globally consistent correspondences, boosting robustness in challenging visual conditions.

Zikun Chen, Wentao Zhao, Yihe Niu +1

Computer Vision Robotics & Embodied AI

Tsinghua AI3w ago·also Google Research, CAS

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.

Wenzhao Xiang, Yue Wu, Hongyang Yu +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Lu Yue +63w ago

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

A single spatial token, learned via occupancy prediction on a massive dataset, is surprisingly effective at injecting crucial spatial awareness into vision-language navigation, leading to state-of-the-art performance.

Lu Yue, Jiazhao Zhang, Qisheng Zhao +4

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Xuhao Qin +43w ago

NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors

Unlock high-fidelity 3D reconstruction for curved visuotactile sensors with just a few simple contacts, thanks to a new physics-consistent calibration framework.

Xuhao Qin, Feiyu Zhao, Yatao Leng +2

Computer Vision Robotics & Embodied AI