Computer Vision - Weekly Roundup

Learning from a single labeled face and a stream of unlabeled data

3w ago·also INRIA, Paris-Saclay

Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.

B. Kveton, Branislav Kveton, Michal Valko

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Hanzhong Guo +103w ago·also ByteDance

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.

Hanzhong Guo, Jie Wu, Jie Wu +8

Computer Vision Multimodal Models RLHF & Preference Learning

May 1, 2026

Sai Niranjan Ramachandran +13w ago

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Decision trees and diffusion models are secretly doing the same thing: optimizing a shared objective called Global Trajectory Score Matching.

Sai Niranjan Ramachandran, S. Sra

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, He Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

All Papers (100)

May 1, 2026

Sai Niranjan Ramachandran +13w ago

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Decision trees and diffusion models are secretly doing the same thing: optimizing a shared objective called Global Trajectory Score Matching.

Sai Niranjan Ramachandran, S. Sra

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, He Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Wenda Chu +63w ago

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Jointly training the tokenizer and autoregressive model slashes ImageNet FID to 1.48, finally making end-to-end autoregressive image generation competitive.

Wenda Chu, Bingliang Zhang, Jiaqi Han +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Stanford HAI3w ago·also Tsinghua AI, Beihang, CUHK, HKUST +1

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.

Houyuan Chen, Hong Li, Xianghao Kong +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Microsoft Research3w ago·also SNU

Map2World: Segment Map Conditioned Text to 3D World Generation

Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang +2

Computer Vision Multimodal Models World Models & Planning

Yan Fang +93w ago

Let ViT Speak: Generative Language-Image Pre-training

Ditch the complex multimodal pre-training pipelines: GenLIP proves a simple language modeling objective can effectively align vision encoders with LLMs, achieving strong performance with less data.

Yan Fang, Mengcheng Lan, Zilong Huang +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Siyuan Huang +83w ago

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

LVLMs can maintain sharper visual focus during long-form generation by adding a lightweight, learnable memory module that bypasses attention dilution.

Siyuan Huang, Xiaoye Qu, Yafu Li +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 30, 2026

Emma Andrews +53w ago

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Quantum autoencoders can purify adversarial examples, boosting the robustness of quantum classifiers by up to 68% without adversarial training.

Emma Andrews, Emma Andrews, Sahan Sanjaya +3

Computer Vision Red-Teaming & Adversarial Robustness

Max-Planck-Institut für Informatik3w ago·also Cambridge

Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification

Stop blurring the details: structure-aware Gaussian Splatting densification uses frequency analysis to resolve high-frequency textures faster and with higher quality.

Linjie Lyu, Ayush Tewari, A. Tewari +4

Online semi-supervised perception: Real-time learning without explicit feedback

3w ago·also INRIA, Intel Labs, Paris-Saclay, Pitt

Forget reinforcement learning; this algorithm learns in real-time without any feedback at all.

B. Kveton, Branislav Kveton, Matthai Philipose +311

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

Haiyang Zhao3w ago

Simply detecting distribution shifts in visual MBRL is easy; the real challenge is applying the right action-level corrections, which this paper tackles with a novel local expert growth strategy.

Haiyang Zhao

Computer Vision Robotics & Embodied AI World Models & Planning

Clemson University3w ago

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Architectural diversity offers surprisingly little defense against adversarial attacks on VLMs for autonomous driving, with physical patches transferring effectively across different models.

David Fernandez, Pedram MohajerAnsari, Amir Salarpour +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Jia-lian Liu +33w ago

AG-TAL: Anatomically-Guided Topology-Aware Loss for Multiclass Segmentation of the Circle of Willis Using Large-Scale Multi-Center Datasets

Segmenting tiny brain arteries just got a whole lot better: a new loss function boosts Dice scores by up to 10% on these critical but challenging structures.

Jia-lian Liu, Jialu Liu, Yue Cui +1

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

3w ago·also HKUST, SJTU, The Hong Kong University

Forget tedious, brittle automation scripts: RL-powered GUI agents are showing signs of "System 2" reasoning without explicit supervision, hinting at a future of truly intelligent digital inhabitants.

Junan Hu, Jian Liu, Jin-Shei Lai +7

Computer Vision RLHF & Preference Learning Tool Use & Agents

3w ago

Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.

Gurucharan Srinivas, G. Srinivas, Joshua Niemeijer +3

Computer Vision Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Tsinghua AI3w ago

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.

Xupeng Chen, Binbin Shi, Chenqian Le +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Nhi Ngoc-Yen Nguyen +53w ago

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Ignoring language-specific structure in scene-text captioning is a recipe for disaster in tonal languages like Vietnamese, but a new graph framework leveraging phonological attention can help.

Nhi Ngoc-Yen Nguyen, Anh-Duc Nguyen, Anh Nguyen +3

Computer Vision Multimodal Models Natural Language Processing

3w ago·also Princeton

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Even GPT-5.1 struggles to distinguish AI-generated academic images from real ones, achieving only 48.8% accuracy, revealing a significant gap between generative and forensic AI capabilities.

Bo Zhang, Bo Zhang, Tzu-Yen Ma +33

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Hina Saeeda +13w ago

Requirements Debt in AI-Enabled Perception Systems Development: An Industrial RE4AI Perspective

The hidden cost of rapidly iterating on AI-enabled perception systems? A growing "Requirements Debt" that threatens auditability, reliability, and certification readiness.

Hina Saeeda, Soniya Abraham

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

Junyoung Lee +133w ago

A 48-camera system finally unlocks real-time, room-scale multi-human, multi-robot interaction research in realistic home environments.

Junyoung Lee, Junyoung Lee, Sookwan Han +11

Computer Vision Multimodal Models Robotics & Embodied AI

Jing Zhang +103w ago

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

By unifying specialized detectors with MLLMs in an agentic framework, Echo-{\alpha} achieves state-of-the-art ultrasound interpretation, suggesting a path to more accurate, interpretable, and transferable medical AI.

Jing Zhang, Wentao Jiang, Tao Huang +8

Computer Vision Multimodal Models Tool Use & Agents

Zujin Guo +63w ago

Generate Your Talking Avatar from Video Reference

Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.

Zujin Guo, Zhenhui Ye, Yi Ren +4

Computer Vision Multimodal Models Speech & Audio

Nuria Alabau-Bosque +83w ago·also Universitat de València

Parameter-Efficient Architectural Modifications for Translation-Invariant CNNs

CNNs are surprisingly fragile to even single-pixel shifts, but strategically placed global average pooling can fix this with a 98% parameter reduction and no accuracy loss.

Nuria Alabau-Bosque, Jorge Vila-Tomas, J. Vila-Tomás +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Gour Mahavidyalaya3w ago·also University of Gour Banga

GourNet: A CNN-Based Model for Mango Leaf Disease Detection

A lightweight CNN can achieve 97% accuracy in classifying mango leaf diseases, offering a practical solution for early disease detection in agriculture.

Ekram Alam, Jaydip Sanyal, Akhil Kumar Das +2

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Tsinghua AI3w ago·also BUPT, Corresponding author

Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.

Junpeng Ding, Zichen Tang, Zichen Tang +21

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Olivier Parisot3w ago

An Extended Evaluation Split for DeepSpaceYoloDataset

A new test split for DeepSpaceYoloDataset helps push the boundaries of automated astronomical object detection by providing a more diverse and challenging evaluation benchmark.

Olivier Parisot

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

3w ago·also CUHK

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation.

By explicitly aligning image features with the hierarchical structure of radiology reports, RIHA generates more clinically accurate and coherent reports than models that treat reports as flat sequences.

Yucheng Chen, Yang Yu, Yufei Shi +3

Computer Vision Multimodal Models Natural Language Processing

Kaixiang Shu3w ago

Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers

CNN classifiers don't just select from cleaned features, they actively cancel out shared background information via destructive interference, rewriting our understanding of how these networks actually "see".

Kaixiang Shu

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

3w ago·also PKU

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Forget task-specific architectures: Uni-HOI uses a unified framework with LLMs to jointly model text, human motion, and object motion, enabling strong performance across diverse HOI tasks.

Mengfei Zhang, Jinlu Zhang, Zhigang Tu

Computer Vision Multimodal Models Natural Language Processing

Menglin Deng +193w ago·also Fudan, RUYi Dynamics Co

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM delivers production-grade VLM/LLM inference performance on edge devices, outperforming vendor-specific toolchains by up to 49% while remaining open-source and cross-platform.

Menglin Deng, Mengling Deng, Yuanpeng Chen +17

Computer Vision Inference & Quantization Multimodal Models

Kck(∫3w ago·also × increase in training time, D GS and Softmax-GS decreases when more Gaussians are used, Oregon State

Softmax-GS: Generalized Gaussians Learning When to Blend or Bound

Stop those blurry edges: Softmax-GS uses learnable competition between Gaussians to sharpen 3D Gaussian Splatting, achieving state-of-the-art performance in novel view synthesis.

Chen Ziwen, Peng Wang, Hao Tan +2

Sparse-View 3D Gaussian Splatting in the Wild

Ajou Univerity3w ago·also GenGenAI, SNU, UT Austin

Achieve high-fidelity 3D rendering from sparse, unconstrained real-world images by intelligently synthesizing novel views with diffusion models and Gaussian replication.

Wongi Park, Jordan A. James, Myeongseok Nam +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Yang Zhou +73w ago

A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation

Real-time glottis segmentation during Nasotracheal Intubation just got a whole lot faster and more accurate, thanks to a new network that's both lightweight and scale-robust.

Yang Zhou, Yang Zhou, Chao Zhang +5

Hyperspectral Image Classification via Efficient Global Spectral Supertoken Clustering

Peifu Liu +53w ago

Achieve faster, more accurate hyperspectral image classification by decoupling pixel clustering from classification, yielding region-level consistency and boundary alignment.

Peifu Liu, Tingfa Xu, Jie Wang +3

Computer Vision

Tsinghua AI3w ago·also Microsoft Research

CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.

Yingrui Wu, Youkang Kong, Mingyang Zhao +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Yabo Luo +43w ago·also Osh State University

Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

Achieve state-of-the-art gait recognition by dynamically fusing body shape and motion features, even when people are wearing coats.

Yabo Luo, Xiaoyu Wang, Xiaoyun Wang +2

SQuadGen: Generating Simple Quad Layouts via Chart Distance Fields

Tsinghua AI3w ago·also Microsoft Research

Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.

Youkang Kong, Yang Liu, Yang Liu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

3w ago·also Mississippi State University, PolyU

Representative Spectral Correlation Network for Multisource Remote Sensing Image Classification

Ditching PCA for spectral reduction can yield state-of-the-art performance in multisource remote sensing image classification while slashing computational costs.

Chuanzheng Gong, Feng Gao, Junyan Lin +2

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Nankai University3w ago·also Huawei

Achieve up to 2.5X faster video object removal by focusing DiT computations only on the essential tokens dictated by the mask.

Chenyang Wu, Lina Lei, Fan Li +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Yizhou Wu +83w ago·also Emory

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

A single self-supervised model trained on millions of unlabeled brain MRI slices can generalize across diverse neuroimaging tasks, rivaling or exceeding specialized models, even with limited labeled data.

Yizhou Wu, Shansong Wang, Yuheng Li +6

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

3w ago·also Dolby Labs

ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

Volumetric videoconferencing doesn't have to freeze and stutter: ReVo recovers up to 32% of lost RGB data and slashes video freezes by 95% using a cross-layer approach.

Ankur Aditya, Diptyaroop Maji, Lingdong Wang +6

Computer Vision Distributed Systems & Hardware

Francisco M. L'opez +123w ago

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

Unlock a baby's-eye view: Reconstruct and replay infant movements on robots to simulate their sensory experiences, offering unprecedented insights into early development.

Francisco M. L'opez, Francisco M. López, Hoshinori Kanazawa +10

Computer Vision Multimodal Models Robotics & Embodied AI

Davide Di Nucci +43w ago·also University of Modena and Reggio Emilia

Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

Current image forensics fall flat when faced with the subtle manipulations now possible in 3D Gaussian Splatting scenes, highlighting a critical gap in content authenticity assessment.

Davide Di Nucci, Riccardo Catalini, G. Borghi +2

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Pengna Li +93w ago

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Teaching VLMs to "look back" and "look ahead" with lightweight spatial reasoning tasks unlocks surprisingly strong navigation performance.

Pengna Li, Kangyi Wu, Shaoqing Xu +7

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago·also HIT

Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

Simple frequency masking and gated injection can dramatically improve the generalization of AI-generated image detectors, even against unseen generative models.

Shuchang Zhou, Shangkun Wu, Shang Wu +3

Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection

Ali Shibli +33w ago·also KTH

Ditch the costly sampling: Noise2Map turns diffusion models into fast, end-to-end semantic segmentation and change detection machines by directly predicting maps from noise.

Ali Shibli, A. Nascetti, Andrea Nascetti +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also Hainan University

HiMix: Hierarchical Artifact-aware Mixup for Generalized Synthetic Image Detection

Existing synthetic image detectors fail to generalize because they're trained on biased data, but HiMix overcomes this with artifact-aware representations and mixup augmentation, achieving state-of-the-art generalization to unseen generators.

Shuchang Zhou, Kaiwen Shen, Jiwei Wei +2

Computer Vision Data Curation & Synthetic Data

Sharayu Nilesh Deshmukh +53w ago

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Current DeepFake detectors can be fooled by semantically inconsistent real audio and video, highlighting a critical blind spot in their ability to assess realistic manipulations.

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa +3

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Xiumei Li +43w ago

TAFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement

Unlock bandwidth-adaptive point cloud transmission with TAFA-GSGC, a single-model codec that delivers up to 9 quality levels from a single bitstream.

Xiumei Li, Alexander Kopte, Alexander Kopte +2

Computer Vision Inference & Quantization

Yujin Han +143w ago

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Expert-level video aesthetics can be captured and improved using a hierarchical rubric and reward models trained with a progressive learning scheme.

Yujin Han, Yujie Wei, Yefei He +12

UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation

Shuokun Cheng +33w ago

By explicitly modeling uncertainty in hypergraph refinement, UHR-Net achieves more accurate segmentation of challenging lesions in medical images.

Shuokun Cheng, Jinghao Shi, Jinghao Shi +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Andrea Dunn Beltran +153w ago

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

Achieve clinically relevant accuracy in dynamic bronchoscopy without breath-hold protocols by modeling patient-specific respiratory deformation within a Gaussian splatting framework.

Andrea Dunn Beltran, Andrea Dunn Beltran, Daniel Rho +13

Computer Vision Robotics & Embodied AI Scientific Discovery & Drug Design

3w ago·also UIUC

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

Forget per-scene optimization: GenWildSplat achieves state-of-the-art 3D reconstruction from sparse, unposed images in real-time using a purely feed-forward approach.

Vinayak Gupta, Vinayak Gupta, Chih-Hao Lin +6

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization+1

3w ago·also D pose data—and proposed A, OT does not treat temporally contiguous

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

Discovering reusable, semantic "Action Motifs" from human movement data unlocks significant gains in action recognition, motion prediction, and interpolation.

Genki Kinoshita, Genki Kinoshita, Shu Nakamura +9

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

3w ago·also HKUST

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

Achieve state-of-the-art open-vocabulary occupancy prediction without any training data, outperforming supervised and self-supervised methods by a large margin.

Zeyu Jiang, Changqing Zhou, Changqing Zhou +3

Computer Vision Robotics & Embodied AI World Models & Planning

CMU ML3w ago·also NEC Labs America

PhyCo: Learning Controllable Physical Priors for Generative Motion

Control over physical properties like friction and restitution in generated videos is now possible, paving the way for more realistic and controllable video synthesis.

Sriram Narayanan, S. Narayanan, Ziyu Jiang +3

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI+1

Jiawei Yang +53w ago

Representation Fr\'echet Loss for Visual Generation

Fréchet Distance, previously deemed impractical for training, unlocks surprisingly high-fidelity image generation when optimized in representation space with decoupled batch sizes.

Jiawei Yang, Zhengyang Geng, Xuan Ju +3

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

DAMO3w ago·also NTU

Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.

Keming Wu, Zuhao Yang, Kaichen Zhang +28

Computer Vision Multimodal Models World Models & Planning

3w ago·also Shanghai AI Lab

World2Minecraft: Occupancy-Driven Simulated Scenes Construction

Reconstructing real-world scenes in Minecraft unlocks a customizable embodied AI playground, but only if we can solve the occupancy prediction bottleneck – and this new dataset shows we're not there yet.

Lechao Zhang, Haoran Xu, Jingyu Gong +3

Computer Vision Robotics & Embodied AI World Models & Planning

3w ago

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

Forget painstakingly programming robot interactions – ExoActor uses video generation to hallucinate plausible behaviors, then translates them into robot actions.

Yang Zhou, Yanghao Zhou, Jingyu Ma +7

Computer Vision Robotics & Embodied AI World Models & Planning

Kehong Gong +133w ago

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Ditch the clunky inverse kinematics: MoCapAnything V2 learns to predict character rotations directly from video, slashing error rates and boosting speed by 20x.

Kehong Gong, Zhengyu Wen, Dao Thien Phong +11

When Do Diffusion Models learn to Generate Multiple Objects?

Yujin Jeong +43w ago

Diffusion models struggle with multi-object generation not because of imbalanced concept representation, but primarily due to scene complexity and a surprising difficulty in counting, especially when training data is limited.

Yujin Jeong, Arnas Uselis, Iro Laina +2

Computer Vision Data Curation & Synthetic Data Multimodal Models

Xin Zhou +63w ago

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

HERMES++ achieves state-of-the-art performance in both future point cloud prediction and 3D scene understanding by unifying these tasks within a single driving world model.

Xin Zhou, Dingkang Liang, Xiwu Chen +4

Computer Vision Robotics & Embodied AI World Models & Planning

Yan Cui +83w ago·also Enable Medicine

Linking spatial biology and clinical histology via Haiku

By jointly embedding spatial biology, histology, and clinical data, Haiku lets you ask "what if" questions about disease progression, revealing molecular shifts linked to clinical outcomes.

Yan Cui, Jacob S. Leiby, Wenhui Lei +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Yongpeng Cao +33w ago

SASI: Leveraging Sub-Action Semantics for Robust Early Action Recognition in Human-Robot Interaction

Robots can now better anticipate your actions thanks to a new method that understands the "sub-actions" within your movements.

Yongpeng Cao, Masahiro Hirano, Hyuno Kim +1

Learning from a single labeled face and a stream of unlabeled data

3w ago·also INRIA, Paris-Saclay

Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.

B. Kveton, Branislav Kveton, Michal Valko

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Nimrod Millenium Ndulue +63w ago·also Faculty of Science, Interdisciplinary Centre for Security, Luxembourg, Robotics Research Group

Learning-Based Hierarchical Scene Graph Matching for Robot Localization Leveraging Prior Maps

Hierarchical scene graph matching, learned end-to-end, unlocks fast and accurate robot localization by grounding real-time sensor data against prior architectural maps.

Nimrod Millenium Ndulue, Jose Andres Millan-Romera, J. A. Millan-Romera +4

Connected Dependability Cage: Run-Time Function and Anomaly Monitoring for the Development and Operation of Safe Automated Vehicles

Iqra Aslam +73w ago·also Clausthal University of Technology

Automated vehicles can achieve fail-operational capabilities by using a hierarchical monitoring framework that combines functional consistency checks with anomaly detection to handle system failures and unfamiliar scenarios.

Iqra Aslam, Nour Habib, Nouran Habib +5

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Hanzhong Guo +103w ago·also ByteDance

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.

Hanzhong Guo, Jie Wu, Jie Wu +8

Computer Vision Multimodal Models RLHF & Preference Learning

Andrew Bond +73w ago·also Hacettepe University, Koç University

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Hyperspherical latent spaces unlock better 3D scene understanding from vision transformers, especially when bandwidth is constrained.

Andrew Bond, Andrew Bond, Ilkin Umut Melanlioglu +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

3w ago

Diffusion-OAMP for Joint Image Compression and Wireless Transmission

Ditch the training data: this method uses a pre-trained diffusion model to jointly compress and transmit images, outperforming classic techniques without any task-specific training.

Wentao Hou, W. Hou, Yiming Bai +4

Computer Vision Inference & Quantization

Corinna Cortes +43w ago

Optimized Deferral for Imbalanced Settings

Expert imbalance can cripple learning-to-defer systems, but a novel cost-sensitive margin-based loss function can restore performance.

Corinna Cortes, Anqi Mao, M. Mohri +2

Computer Vision Natural Language Processing Training Efficiency & Optimization

Utrecht University3w ago

From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pok\'emon Case Study

Imagine a Pokemon TCG where every card is uniquely yours, dynamically generated by AI to reflect your playstyle and preferences.

Johannes Pfau, Panagiotis Vrettis

Computer Vision Multimodal Models Natural Language Processing

Stanford HAI3w ago

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

Forget noisy starts – ABC diffusion models leverage the inherent structure of continuous processes, generating future states from already-close previous states for more realistic dynamics.

Gabriel Guo, Gabe Guo, Thanawat Sornwanee +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Guang Yang +33w ago

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

MLLMs can ace circuit-to-code generation by cheating with identifier semantics, even when the circuit diagram is blank.

Guang Yang, Xing Hu, Xiang Chen +1

Code Generation & Program Synthesis Computer Vision Multimodal Models

Ce Chen +83w ago·also HeyGen Research

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.

Ce Chen, Yi Ren, Yuanming Li +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Shipeng Liu +43w ago

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Achieve detailed tunnel defect inspection without any training by visually recalibrating foundation model proposals to overcome tunnel-specific interference.

Shipeng Liu, Liang Zhao, Liang Zhao +2

Computer Vision Natural Language Processing

Pieter C. Gort +83w ago·also Catharina Hospital Eindhoven

Deep Learning-Based Segmentation of Peritoneal Cancer Index Regions from CT Imaging

Automated segmentation of radiological Peritoneal Cancer Index (rPCI) regions from CT scans is now feasible, potentially replacing invasive surgical assessment for peritoneal metastases.

Pieter C. Gort, Lotte J. S. Ewals, L. Ewals +6

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

Wei Li +63w ago·also Guangdong AIHISUN Technology Co.

You can now get real-time (825 FPS) crack detection on UAVs without sacrificing accuracy, thanks to a new attention-enhanced lightweight CNN.

Wei Li, Haisheng Li, Weijie Li +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Ke Xu3w ago

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

A carefully crafted synthetic data pipeline and rubric-guided RL lets a 4B parameter model nearly match Gemini-3-Flash on wafer defect analysis, suggesting that data quality and targeted training can trump sheer model size.

Ke Xu

Computer Vision Data Curation & Synthetic Data Multimodal Models

Shaanxi Normal University3w ago

Student Classroom Behavior Recognition Based on Improved YOLOv8s

Overcome the chaos of classroom behavior recognition with ALC-YOLOv8s, achieving state-of-the-art detection of dense, occluded, and imbalanced student actions.

Xiang Gao, Xiangyi Gao, Shuai Hang

Computer Vision

Furkan Kınlı +13w ago

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

Night photography can now look stunningly realistic, thanks to a new rendering technique that beats existing methods on perceptual quality and color accuracy.

Furkan Kınlı, Furkan Kınlı

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Chialoon Cheng +73w ago

3D Reconstruction Techniques in the Manufacturing Domain: Applications, Research Opportunities and Use Cases

Despite advances in deep learning, manufacturing-focused 3D reconstruction still struggles with reflective surfaces and dynamic environments, highlighting the need for robust hybrid systems.

Chialoon Cheng, K. Liu, Kaijun liu +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

3w ago·also UQ

ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss

Existing 3D human mesh recovery systems fall apart for individuals with limb loss, but ResiHMR explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization, opening the door to more inclusive and accurate human modeling.

Jiaying Ying, Heming Du, Kaihao Zhang +3

Continuous-tone Simple Points: An $\ell_0$-Norm of Cyclic Gradient for Topology-Preserving Data-Driven Image Segmentation

Wenxiao Li +93w ago

Guaranteeing topological consistency in image segmentation is now possible within deep learning frameworks thanks to a novel differentiable simple point computation method applicable to continuous-valued images.

Wenxiao Li, Wenxiao Li, Faqiang Wang +7

3D-ReGen: A Unified 3D Geometry Regeneration Framework

Meta AI3w ago·also Oxford

Controllable 3D generation takes a leap forward with 3D-ReGen, a framework that leverages an initial 3D shape for tasks like enhancement and editing, outperforming existing methods.

Geon Yeong Park, Geon Yeong Park, Roman Shapovalov +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Di Shao +153w ago·also JIUTIAN Research, NJU

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Ditch the garment masks: a simple human mask is all you need to nail video virtual try-on in the wild.

Di Shao, Dingbao Shao, Songhan Wu +13

Computer Vision Data Curation & Synthetic Data Multimodal Models

Oleg I. Berngardt +33w ago

Physically-Informed Fuzzy Clustering of Vertical Sounding Ionograms

Achieve robust ionogram track separation, even under disturbed ionospheric conditions with unknown track numbers, by integrating physical models into fuzzy clustering.

Oleg I. Berngardt, Oleg I.Berngardt, Sergey N. Ponomarchuk +1

Self-Supervised Learning of Plant Image Representations

Ilyass Moummad +83w ago·also CIRAD, INRAE, INRIA, LIRMM +1

Seemingly innocuous augmentations like blur can cripple self-supervised learning for fine-grained tasks like plant identification, but domain-aware choices unlock surprisingly strong performance.

Ilyass Moummad, Kawtar Zaher, Hervé Goeau +6

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Shiqi Xu +53w ago

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.

Shiqi Xu, Moritz Burmester, Katharina Prasse +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Yubo Dong +33w ago

RayFormer: Modeling Inter- and Intra-Ray Similarity for NeRF-Based Video Snapshot Compressive Imaging

NeRFs get a boost in video reconstruction quality by explicitly modeling inter- and intra-ray similarities with a novel transformer architecture.

Yubo Dong, Danhua Liu, Anqi Li +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Junyi Ma +63w ago·also SJTU

Robot Learning from Human Videos: A Survey

Unlock generalist robots by learning manipulation skills directly from the abundance of human activity videos, bypassing the robot data bottleneck.

Junyi Ma, Erhang Zhang, Haoran Yang +4

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

3w ago

Initializing prompts in flatter regions of the loss landscape dramatically improves calibration and performance in test-time prompt tuning for vision-language models.

Hyeonseo Jang, Hyeon-Gi Jang, Jaebyeong Jeon +3

Computer Vision Multimodal Models Training Efficiency & Optimization

Ji-Hyeon Kim +23w ago

ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

By explicitly modeling relationships between multiple relevant video segments, ClipTBP significantly improves video moment retrieval, especially when queries are ambiguous.

Ji-Hyeon Kim, Ho-Joong Kim, Seong-Whan Lee

Computer Vision Multimodal Models Recommendation & Information Retrieval

Yuan Fang +63w ago

A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images

Stop wasting compute pre-training on domain-specific datasets; this simple strategy lets you pre-train on ImageNet and still achieve state-of-the-art results on diverse remote sensing segmentation tasks.

Yuan Fang, Yuanzhi Cai, Jagannath Aryal +4

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Bohai Zhang +113w ago

MSR:Hybrid Field Modeling for CT-MRI Rigid-Deformable Registration of the Cervical Spine with an Annotated Dataset

Achieve superior CT-MRI cervical spine registration by adaptively fusing Mamba-based global context with Swin Transformer-based local detail.

Bohai Zhang, Wenjie Chen, Mu Li +9

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Dahua Gao +63w ago

FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging

Ditch the post-capture processing bottleneck: FUN achieves real-time hyperspectral object detection by jointly learning reconstruction and detection in a single, efficient network.

Dahua Gao, Yubo Dong, Anqi Li +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

3w ago·also XJTU

Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models

LVLMs leak visual text style into semantic inference, meaning the font of a word can change the attributes a model associates with the concept it represents.

Xiaomeng Wang, Martha Larson, Zhengyu Zhao

Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction

Jian Lin +73w ago

Achieve high-fidelity CBCT reconstructions from ultra sparse-view data by decoupling geometry and texture in 3D Gaussian Splatting, enabling physically consistent residual detail compensation.

Jian Lin, Jiancheng Fang, Shaoyu Wang +5