March 11 – March 18, 2026

Computer Vision - Weekly Roundup

100 papers published across 4 labs.

10% acceleration

Selected Labs publishing this week

Tsinghua AI4 NVIDIA1 DAMO1 DeepMind1

Top Papers

Mar 18, 2026

2w ago·also HKUST

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.

Daowen Li, Ruixiao Dong, Ying Chen +2

Computer Vision Inference & Quantization

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

NVIDIA2w ago·also HUST

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Tianxing Zhou, Fei Xue, Feiyang Xue +4

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago·also B LLM consistently underperforming

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.

Dongqiang Gou, Xuming He

Computer Vision Multimodal Models Robotics & Embodied AI

Zirui Li +92w ago·also KU, Sofia University "St. Kliment Ohridski"

Video Understanding: From Geometry and Semantics to Unified Models

The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.

Zirui Li, Mingqiao Ye, Feng Qiao +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

All Papers (100)

Mar 18, 2026

2w ago·also HKUST

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.

Daowen Li, Ruixiao Dong, Ying Chen +2

Computer Vision Inference & Quantization

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

NVIDIA2w ago·also HUST

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Tianxing Zhou, Fei Xue, Feiyang Xue +4

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago·also B LLM consistently underperforming

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

Dongqiang Gou, Xuming He

Computer Vision Multimodal Models Robotics & Embodied AI

Zirui Li +92w ago·also KU, Sofia University "St. Kliment Ohridski"

Video Understanding: From Geometry and Semantics to Unified Models

The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.

Zirui Li, Mingqiao Ye, Feng Qiao +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Ruixiao Shi +32w ago

A Creative Agent is Worth a 64-Token Template

Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.

Ruixiao Shi, Fu Feng, Yucheng Xie +1

Computer Vision Multimodal Models Tool Use & Agents

2w ago

Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.

Haiyu Yang, Sumit Sharma, Enhong Liu +1

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Seongrae Noh +32w ago

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

By treating 3D scene editing as goal-regressive planning rather than pure generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility that existing methods miss.

Seongrae Noh, SeungWon Seo, Gyeong-Moon Park +1

Computer Vision Robotics & Embodied AI World Models & Planning

2w ago·also University of California

DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.

Sarra Harrabi, Yichen Wu, Geoffrey H. Tison +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Chaokang Jiang +32w ago

VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

Achieve stable, real-time kilometer-scale autonomous driving simulations by generating vector-graph tiles incrementally using a novel diffusion flow approach.

Chaokang Jiang, Desen Zhou, Jiuming Liu +1

Computer Vision Robotics & Embodied AI World Models & Planning

Kehan Chen +72w ago

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.

Kehan Chen, Yan Huang, Dong An +5

Computer Vision Multimodal Models Robotics & Embodied AI

Xiamen University2w ago·also ECNU, Hanjiang National Laboratory, NJU, Tongren University

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.

Wenbin Tan, Jiawen Lin, Fangyong Wang +4

Computer Vision Multimodal Models Natural Language Processing

Xianhang Cheng +32w ago

Steering Video Diffusion Transformers with Massive Activations

Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.

Xianhang Cheng, Yujian Zheng, Zhenyu Xie +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

2w ago·also Adobe Research

LoST: Level of Semantics Tokenization for 3D Shapes

Forget geometric LODs: tokenizing 3D shapes by semantic salience unlocks SOTA reconstruction and efficient autoregressive generation with 10x-1000x fewer tokens.

Niladri Shekhar Dutt, Niladri Shekhar Dutt, Zifan Shi +11

Architecture Design (Transformers, SSMs, MoE)Computer Vision

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Byron Dowling +32w ago

VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.

Byron Dowling, Eleanor Frederick, Jacob Piland +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Yang-Tian Sun +62w ago

Stereo World Model: Camera-Guided Stereo Video Generation

Generate consistent stereo videos directly from RGB data, bypassing depth estimation and monocular-to-stereo conversion, with StereoWorld's novel camera-aware attention mechanisms.

Yang-Tian Sun, Zehuan Huang, Yifan Niu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

Vlad-Constantin Lungu-Stan +22w ago

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

Computer Vision Multimodal Models Natural Language Processing

Pengzhen Chen +52w ago

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Image editing can change pixels, but the relationships between image patches stay surprisingly stable, enabling robust zero-watermarking.

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu +3

Computer Vision Red-Teaming & Adversarial Robustness

Romil Imtiaz +12w ago

ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

Class reweighting and anatomy-guided decoding can substantially improve the performance of video analysis pipelines for rare events in imbalanced gastrointestinal datasets.

Romil Imtiaz, Dimitris K. Iakovidis

Computer Vision Training Efficiency & Optimization

2w ago

REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering

Legged robots can now perform robust parkour with a 1-meter visual blind zone, thanks to a novel architecture that tightly couples vision, proprioception, and physics-based state estimation.

Jialong Liu, Dehan Shen, Yanbo Wen +2

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

2w ago

Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.

Xinning Chai, Zhengxue Cheng, Rong Xie +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Boyong Wu +12w ago·also Munich Center for Machine Learning

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.

Boyong Wu, Zeynep Akata

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

A. Humnabadkar +52w ago

From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving

Synthetic data and virtual environments are rapidly becoming indispensable for autonomous driving, but realizing their full potential requires tackling challenges like Sim2Real transfer and scalable safety validation.

A. Humnabadkar, A. Sikdar, B. Cave +3

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

2w ago

Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

Injecting "beneficial noise" into cross-attention mechanisms can significantly improve unsupervised domain adaptation by forcing models to focus on content rather than style distractions.

Zelin Zang, Yehui Yang, Fei Wang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Tsinghua AI2w ago·also DAMO

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.

Songtao Jiang, Sibo Song, Chenyi Zhou +11

Computer Vision Data Curation & Synthetic Data Multimodal Models

Qijie Wei +22w ago

EI: Early Intervention for Multimodal Imaging based Disease Recognition

Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.

Qijie Wei, Hailan Lin, Xirong Li

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Achieve state-of-the-art semantic 3D reconstruction from sparse views by intelligently pruning redundant Gaussians and blending 2D and 3D semantic cues.

Guibiao Liao, Kaimin Liao, Hua Wang +3

Computer Vision Multimodal Models Robotics & Embodied AI

University2w ago

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Simply translating symbolic sign language notations into natural language unlocks significantly better motion generation when conditioning on phonological attributes with CLIP.

Rui Hong, Jana Kosecka

Computer Vision Multimodal Models Natural Language Processing

2w ago·also Tencent AI

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.

Yingjie Chen, Shilun Lin, Cai Xing +5

Computer Vision Multimodal Models Speech & Audio

2w ago

TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

Unlock automated creation of production-ready 3D assets from untextured meshes with TAPESTRY, which generates geometrically consistent turntable videos that can be back-projected into UV textures or used to supervise neural rendering.

Yan Zeng, Haoran Jiang, Kaixin Yao +4

Computer Vision Multimodal Models

Yan Liang +42w ago

Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification

Counterintuitively, the most *unreliable* samples in medical imaging datasets—those with fluctuating confidence and frequent forgetting during training—are the *most* informative for building accurate decision boundaries.

Yan Liang, Ziyuan Yang, Zhuxin Lei +2

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Xiangyu Kong +82w ago

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.

Xiangyu Kong, Xiaoyu Jin, Yihan Pan +6

Computer Vision Multimodal Models Speech & Audio

2w ago·also HKU

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

By disentangling semantic and contextual cues in vision-language models, PCA-Seg achieves state-of-the-art open-vocabulary segmentation with only 0.35M additional parameters per block.

Jianjian Yin, Tao Chen, Yi Chen +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Training video diffusion models with pixel-wise losses just got a whole lot cheaper: ChopGrad reduces memory complexity from linear to constant with video length.

Dmitriy Rivkin, Parker Ewen, Lili Gao +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Mohammad Robaitul Islam Bhuiyan +52w ago

LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Radiologist dictation, combined with foundation models and minimal parameter updates, can achieve state-of-the-art MRI brain tumor segmentation.

Mohammad Robaitul Islam Bhuiyan, Melika Qahqaie, Tri-Thien Nguyen +3

Computer Vision Multimodal Models Natural Language Processing

Qihong Tang +62w ago

Prompt-Free Universal Region Proposal Network

Forget prompt engineering: this new region proposal network spots objects across diverse datasets without *any* text or image prompts.

Qihong Tang, Qi Tang, Changhan Liu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Roja Sahoo +12w ago

Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging

Flash photography reveals subtle material differences in fingerprints, enabling more robust spoof detection compared to traditional single-image methods.

Roja Sahoo, Anoop Namboodiri

Computer Vision

Fei Zhang +102w ago

TransText: Transparency Aware Image-to-Video Typography Animation

Achieve high-fidelity transparent text animations from image-to-video models without retraining the VAE, sidestepping data scarcity and latent pattern mixing issues.

Fei Zhang, Zijian Zhou, Bohao Tang +8

Computer Vision Multimodal Models

2w ago·also Tsinghua AI, INRIA

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.

Haocheng Yuan, Adrien Bousseau, Hao Pan +2

Computer Vision Robotics & Embodied AI

2w ago

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.

Yuxin Liu, Fei Wang, Yiqi Nie +3

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Carter Sale +102w ago

Facial Movement Dynamics Reveal Workload During Complex Multitasking

Facial micro-movements betray your cognitive load, revealing a new pathway to real-time workload monitoring using just a webcam.

Carter Sale, Melissa N. Stolar, Gaurav Patil +8

Computer Vision

Yizheng Song +72w ago

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

Reconstructing realistic 3D human crowds from a single image is now possible, thanks to a new method that cleverly handles occlusions and appearance variations.

Yizheng Song, Yiyu Zhuang, Qipeng Xu +5

Computer Vision Multimodal Models

Ivor J. A. Simpson +12w ago

Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration

Ditch the overconfident posteriors: Structured SIR offers a memory-efficient way to capture complex, multi-modal uncertainty in high-dimensional image registration, outperforming variational inference.

Ivor J. A. Simpson, Neill D. F. Campbell

Computer Vision Training Efficiency & Optimization

Kevin Qu +52w ago

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.

Kevin Qu, Haozhe Qi, Mihai Dusmanu +3

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

By cleverly using readily available video segmentation masks, this method boosts DINOv2's point tracking performance by over 14% – a surprisingly effective way to inject temporal awareness into static image-pretrained models.

Qiangqiang Wu, Tianyu Yang, Jia Wan +3

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

SafeLand: Safe Autonomous Landing in Unknown Environments with Bayesian Semantic Mapping

Drones can now land safely in complex, unknown environments using only a camera, thanks to a new system that dynamically maps and reacts to surroundings in real-time.

Markus Gross, Andreas Greiner, Sai Bharadhwaj Matha +5

Computer Vision Robotics & Embodied AI World Models & Planning

2w ago

Uncertainty Quantification and Risk Control for Multi-Speaker Sound Source Localization

Sound source localization gets a reliability upgrade: conformal prediction delivers uncertainty estimates, even when you don't know how many speakers are talking.

Vadim Rozenfeld, Bracha Laufer Goldshtein

Computer Vision Speech & Audio

2w ago

Harnessing the Power of Foundation Models for Accurate Material Classification

Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.

Qingran Lin, Fengwei Yang, Chaolun Zhu

Computer Vision Data Curation & Synthetic Data Multimodal Models

Nathan Zhao2w ago

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.

Nathan Zhao

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

2w ago·also Fondazione Bruno Kessler

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.

Xinyuan Qian, Xinjia Zhu, A. Brutti +1

Computer Vision Multimodal Models Speech & Audio

Nikhil Gosala +32w ago

Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision

Unlock accurate monocular 3D object tracking with minimal annotation: Sparse3DTrack achieves state-of-the-art performance using only a handful of labels per track.

Nikhil Gosala, B. Kiran, S. Yogamani +1

Computer Vision Robotics & Embodied AI

Ruixiang Wang +52w ago

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Robot world models can be significantly improved by directly rewarding them for generating videos that lead to physically plausible robot actions, even if the videos themselves contain visual artifacts.

Ruixiang Wang, Qingming Liu, Yueci Deng +3

Computer Vision Robotics & Embodied AI World Models & Planning

Adam Dai +62w ago

Full Stack Navigation, Mapping, and Planning for the Lunar Autonomy Challenge

A complete autonomy stack enables centimeter-level localization and mapping on the moon, even without GPS.

Adam Dai, Asta Wu, Keidai Iiyama +4

Computer Vision Robotics & Embodied AI World Models & Planning

Guandong Li +12w ago

Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.

Guandong Li, Zhaobin Chu

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

2w ago·also Huawei, Northwestern, ZJU

SpiderCam: Low-Power Snapshot Depth from Differential Defocus

SpiderCam shatters power consumption barriers for FPGA-based 3D cameras, achieving sub-Watt operation while maintaining real-time performance.

Marcos A. Ferreira, Tianao Li, John Mamish +3

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Anwai Archit +12w ago

Revisiting foundation models for cell instance segmentation

A new prompting strategy closes the gap between general-purpose and specialized cell segmentation models, suggesting a path to more efficient adaptation.

Anwai Archit, Constantin Pape

Computer Vision Scientific Discovery & Drug Design

Aadi Joshi +12w ago

Adaptive Fuzzy Logic-Based Steganographic Encryption Framework: A Comprehensive Experimental Evaluation

Steganography gets smarter: this framework hides data more effectively by adapting the amount of information concealed in each pixel based on image complexity and payload size.

Aadi Joshi, Kavya Bhand

Computer Vision Natural Language Processing

Sai Bharadhwaj Matha +42w ago

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Unlock scalable aerial scene understanding with SegFly, a massive RGB-T dataset generated via a novel 2D-3D-2D label propagation technique that requires minimal manual annotation.

Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan +2

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Tsinghua AI2w ago·also CAS, Northwestern

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.

Chen Zhao, Zhuoran Wang, Shifeng Bao +5

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago·also Beihang, Beijing National Research Center for Information, HKU, Institute of Artificial Intelligence

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.

Ziwei Xiang, Fanhu Zeng, Hongjian Fang +6

Computer Vision Inference & Quantization Multimodal Models

Obvious Research2w ago·also Sorbonne

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Achieve 4K image-to-video generation with diffusion models without training by cleverly fusing tiled denoising with a low-resolution latent prior, balancing detail and global coherence.

Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Yaze Zhao +32w ago

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.

Yaze Zhao, Yixiong Zou, Yuhua Li +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Tae Eun Choi +32w ago

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.

Tae Eun Choi, Sumin Shim, Junhyeok Kim +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jaein Kim +32w ago

Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

Achieve SE(3) equivariance and memory scalability in point cloud analysis with coordinate-based kernels, outperforming state-of-the-art equivariant methods on diverse tasks.

Jaein Kim, Hee Bin Yoo, Dong-Sig Han +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Manuel Barusco +32w ago

AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.

Manuel Barusco, Davide Dalle Pezze, Francesco Borsatti +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Chupeng Liu +42w ago

VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

By probabilistically fusing visual context into text prompts, VirPro closes the semantic gap in weakly-supervised 3D detection, boosting performance by nearly 5% on KITTI.

Chupeng Liu, Jiyong Rao, Shangquan Sun +2

Computer Vision Multimodal Models

Warsaw University of Technology2w ago

DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

Mamba, the darling of sequence modeling, now powers a GAN that beats StyleGAN2-ADA in image synthesis, thanks to a clever latent space routing trick.

Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita

Architecture Design (Transformers, SSMs, MoE)Computer Vision

2w ago·also COWARobot Co. Ltd, Hohai

VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

By cleverly turning novel view synthesis into a self-supervised inpainting problem, VisionNVS eliminates the need for ground truth images of novel views, outperforming LiDAR-dependent baselines.

Hongbo Lu, Chenghao He, Fan Liu +3

Computer Vision Robotics & Embodied AI World Models & Planning

Xingxing Xie +32w ago

Does YOLO Really Need to See Every Training Image in Every Epoch?

YOLO can learn faster and better by strategically skipping redundant images during training, achieving a 1.43x speedup and improved accuracy with a new Anti-Forgetting Sampling Strategy.

Xingxing Xie, Jiahua Dong, Junwei Han +1

Computer Vision Training Efficiency & Optimization

Shima Yousefi +12w ago

Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.

Shima Yousefi, Saptarshi Debroy

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

School of Computer Science and Technology2w ago

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Unlock the power of MLLMs for structured data like human skeletons with a differentiable rendering approach that allows end-to-end training.

Ziyi Wang, Xinshun Wang, Yang Tang +2

Computer Vision Multimodal Models Robotics & Embodied AI

Shuyao Shi +12w ago·also Corresponding author

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

By fusing IMU-derived egomotion with visual data, Motion-MLLM lets MLLMs achieve SOTA 3D scene understanding with 40% less compute.

Shuyao Shi, Kang G. Shin

Computer Vision Multimodal Models Robotics & Embodied AI

Kai Zou +42w ago

EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

By unifying layout-to-image generation and image grounding with a novel cycle-consistent learning approach, EchoGen achieves state-of-the-art results in both tasks, proving that solving two problems at once can be better than solving them separately.

Kai Zou, Hongbo Liu, Jianxiong Gao +2

Computer Vision Multimodal Models

DeepMind2w ago

Versatile Editing of Video Content, Actions, and Dynamics without Training

Forget finetuning: DynaEdit unlocks complex video edits like action modification and object insertion, all without training, using clever manipulation of pretrained text-to-video models.

Vladimir Kulikov, Roni Paiss, Andrey Voynov +3

Computer Vision Multimodal Models World Models & Planning

Chen Liyi +42w ago

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Forget waiting minutes for iterative optimization – Omni-3DEdit performs diverse 3D editing tasks in a single forward pass.

Chen Liyi, Wang Pengfei, Zhang Guowen +2

Computer Vision Multimodal Models Robotics & Embodied AI

Jingchun Yang is with Northeast2w ago

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.

Jingchun Yang, Jinchang Zhang

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

By adaptively calibrating facts and augmenting emotions, FACE-net overcomes the factual-emotional bias that plagues emotional video captioning.

Weidong Chen, Cheng Ye, Zhendong Mao +5

Computer Vision Multimodal Models Natural Language Processing

Skeleton-ID2w ago

A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

An AI model can estimate legal age from clavicle CT scans with higher accuracy than human experts, potentially revolutionizing forensic age assessment.

Javier Venema, Stefano De Luca, Pablo Mesejo +1

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Haoyun Chen +32w ago·also University of Science and Technology

Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

A new prompt-free medical image segmentation model achieves impressive zero-shot and cross-modal transfer performance by explicitly disentangling geometric and semantic anatomical knowledge.

Haoyun Chen, Fenghe Tang, Wenxin Ma +1

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Xinze Li +42w ago

S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

By reorganizing 3D scenes into structurally-aware subscenes, S-VGGT offers a parallel geometric bridge for efficient processing, slashing global attention costs without compromising reconstruction fidelity.

Xinze Li, Pengxu Chen, Yiyuan Wang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

2w ago·also D features into accurate

ReLaGS: Relational Language Gaussian Splatting

Skip the costly training and go straight to open-vocabulary 3D reasoning with ReLaGS, which builds a 3D semantic scene graph from language-distilled Gaussians.

Yaxu Xie, Alireza Javanmardi, Christen Millerdurai +4

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI2w ago

UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection

A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.

Shenghui Huang, Menghao Hu, Longkun Zou +3

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Luigi Celona +22w ago

Face anonymization preserving facial expressions and photometric realism

Anonymized faces don't have to be expressionless blobs: this method preserves realistic expressions and lighting while scrambling identity.

Luigi Celona, Simone Bianco, Raimondo Schettini

Computer Vision Data Curation & Synthetic Data

2w ago

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Overcome weather limitations in remote sensing with MM-OVSeg, a multimodal Optical-SAR fusion framework that enables robust open-vocabulary segmentation even under cloudy conditions.

Yimin Wei, Aoran Xiao, Junshi Xia +1

Computer Vision Multimodal Models

Sirong Piao +122w ago

Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis

AI spots a hidden pattern in lung scans of lupus patients, revealing that specific airway dilations in the upper lobes could be a telltale sign of interstitial lung disease.

Sirong Piao, Ying Ming, Ruijie Zhao +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Mengyu Zhao +62w ago

Shot-Aware Frame Sampling for Video Understanding

Grabbing two keyframes per shot – one for the gist, one for the glitch – lets you compress videos for VLMs without missing critical anomalies.

Mengyu Zhao, Di Fu, Yongyu Xie +4

Computer Vision Multimodal Models

Guanlin Feng +12w ago

AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.

Guanlin Feng, Hongsheng Li

Computer Vision Multimodal Models RLHF & Preference Learning

Chaeyun Kim +32w ago

Towards Motion-aware Referring Image Segmentation

RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.

Chaeyun Kim, Seunghoon Yi, Yohan Jo +1

Computer Vision Data Curation & Synthetic Data Multimodal Models

University2w ago

Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Achieve competitive video generation with Stable Diffusion using only 2.9% additional parameters by adapting temporal attention based on motion content, outperforming methods with explicit temporal consistency losses.

Rui Hong, Shuxue Quan

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Adam Dai +22w ago

Neural Radiance Maps for Extraterrestrial Navigation and Path Planning

NeRFs can now guide extraterrestrial rovers around unexpected obstacles, thanks to a novel planning framework that blends local observations with global terrain understanding.

Adam Dai, Shubh Gupta, Grace Gao

Computer Vision Robotics & Embodied AI World Models & Planning

Yigit Ekin +12w ago

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Surprisingly, you can achieve smooth, controllable image editing in text-to-image models without any training, just by intelligently nudging the text embeddings.

Yigit Ekin, Yossi Gandelsman

Computer Vision Multimodal Models Natural Language Processing

2w ago

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Panoramic 3D reconstruction gets a boost with PanoVGGT, a Transformer that handles spherical distortions and global-frame ambiguity to deliver state-of-the-art accuracy in a single pass.

Yijing Guo, Mengjun Chao, Luo Wang +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

University2w ago

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Gesture-aware pretraining unlocks significant improvements in 3D hand pose estimation, proving that semantic gesture information acts as a powerful inductive bias.

Rui Hong, Jana Kosecka

Computer Vision Multimodal Models Robotics & Embodied AI

Podakanti Satyajith Chary +12w ago

Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.

Podakanti Satyajith Chary, Nagarajan Ganapathy

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago·also Imperial, KAUST

AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Reconstructing complete, animatable 3D avatars from heavily occluded YouTube videos is now possible, thanks to a hallucination-as-supervision pipeline using diffusion models.

Aymen Mir, Riza Alp Guler, Xiangjun Tang +2

Computer Vision Multimodal Models

David Restrepo +132w ago

On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

Medical vision-language models perform better when the modality gap is tuned to an intermediate level, challenging the assumption that minimizing it is always optimal.

David Restrepo, David Restrepo, Miguel L Martins +11

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Yuhe Tian +62w ago·also University of Science and Technology

DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

By focusing on semantic differences between scans, DiffVP lets LLMs generate more accurate CT reports without needing explicit lesion localization.

Yuhe Tian, Kun Zhang, Haoran Ma +4

Computer Vision Multimodal Models Natural Language Processing

2w ago·also Didichuxing Co. Ltd

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.

Weiming Wu, Zi-Jian Cheng, Jie Meng +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Search

Computer Vision - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)