Computer Vision - Weekly Roundup

GenMask: Adapting DiT for Segmentation via Direct Mask Generation

All Papers (100)

Mar 25, 2026

Yu-Hao Yang +81w ago

Ditch the feature extraction pipeline: GenMask directly generates segmentation masks with a diffusion transformer, achieving SOTA results by harmonizing mask and image generation in a single model.

Yu-Hao Yang, Xianwei Zhuang, Yuxuan Cai +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Yihan Wang +11w ago

WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

Cost volumes might be overkill: WAFT-Stereo proves you can ditch them for a warping-based approach and still dominate stereo matching benchmarks with significantly improved efficiency.

Yihan Wang, Jia Deng

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

AI21w ago

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.

Qi He, Xunmei Liu, Hammaad Memon +6

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

Mar 24, 2026

Wenyue Chen +81w ago

Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.

Wenyue Chen, Wenjue Chen, Peng Li +6

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Mar 19, 2026

Yue Gong +101w ago

Representation-Pivoted Autoencoders enable diffusion models to generate and edit images with higher fidelity by learning a compressed latent space that preserves the semantics of pre-trained visual representations.

Yue Gong, Hongyu Li, Shanyuan Liu +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Hyun-kyu Ko +41w ago

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Forget generating plausible-but-fake details: 3DreamBooth bakes a robust 3D prior into video generation models using only a single-frame optimization, enabling truly view-consistent customized subject videos.

Hyun-kyu Ko, Jihyeon Park, Younghyun Kim +2

Computer Vision Multimodal Models World Models & Planning

1w ago·also CUHK, Received 25 December 2024; revised 1

Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation.

Even with only 5% labeled data, Switch achieves ultrasound segmentation accuracy exceeding fully supervised methods, thanks to its clever multiscale and frequency-domain switching.

Jingguo Qu, Xinyang Han, Yao Pu +10

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

1w ago

Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Explicitly reconstructing 3D scenes with Gaussian Splatting unlocks state-of-the-art BEV perception, proving that geometric understanding is key to accurate spatial reasoning.

Yiren Lu, Xin Ye, Burhaneddin Yaman +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Vsevolod Skorokhodov +41w ago·also Schindler

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Fine-tuning a visual geometry transformer with SEAR unlocks surprisingly accurate RGB-Thermal 3D reconstruction, even surpassing SOTA methods despite training on significantly less multimodal data.

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Corresponding authors1w ago

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Closed-loop feedback using VLMs can dramatically improve text-to-image generation quality, even without additional training.

Ping Chen, Daoxuan Zhang, Xiangming Wang +4

Computer Vision Multimodal Models Tool Use & Agents

TU Dortmund University1w ago

Hardness of High-Dimensional Linear Classification

Linear classification, a cornerstone of machine learning, is provably harder than we thought in high dimensions.

Alexander Munteanu, Simon Omlor, Jeff M. Phillips

Computer Vision Natural Language Processing

Joerg H. Mueller +41w ago

From ex(p) to poly: Gaussian Splatting with Polynomial Kernels

Unlock 4-15% faster Gaussian Splatting without retraining your existing datasets by swapping in a polynomial kernel.

Joerg H. Mueller, J. Mueller, Martin Winter +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Ye Kyaw Thu +51w ago

myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

CNNs still reign supreme in Burmese handwritten digit recognition, but physics-inspired PETNNs are hot on their heels, outperforming Transformers and KANs.

Ye Kyaw Thu, Ye Kyaw Thu, T. Oo +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

Phuc Pham +41w ago

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Forget waiting a minute for garment generation: SwiftTailor slashes inference times while boosting accuracy by representing 3D garments as geometry images.

Phuc Pham, Uy Dieu Tran, Binh-Son Hua +2

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou +51w ago·also Project leader

Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.

Weijia Dou, Wenzhao Zheng, Weiliang Chen +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Telang Xu +31w ago

FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

Achieve state-of-the-art single image reflection removal by explicitly guiding a diffusion model with spatial intensity and high-frequency priors derived directly from the input image.

Telang Xu, Chaoyang Zhang, Guangtao Zhai +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Xiangyu Bai +31w ago

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Forget brute-force scaling: intelligently selecting just 1% of video frames can actually *improve* video QA accuracy and cut compute by 93%.

Xiangyu Bai, Bishoy M. Galoaa, Bishoy Galoaa +1

Computer Vision Multimodal Models Training Efficiency & Optimization

Jakob Lønborg Christensen +81w ago

Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Decomposing uncertainty into aleatoric and epistemic components in image segmentation is often misleading due to substantial entanglement, but ensembles offer a surprisingly robust and less entangled alternative.

Jakob Lønborg Christensen, J. Christensen, V. Dahl +6

AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

1w ago

Ditch one-hot vectors: representing facial action units as natural language unlocks more realistic and nuanced facial expression synthesis, especially when dealing with conflicting muscle movements.

Jiahe Wang, Cong Liang, Xuandong Huang +5

Computer Vision Natural Language Processing Speech & Audio

Haonan Ping +51w ago

SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

Scribble prompts beat point prompts for interactive surgical segmentation, achieving state-of-the-art Dice scores with fewer interactions.

Haonan Ping, Jian Jiang, Cheng Yuan +3

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang +71w ago

Object detectors in new visual domains suffer from "astigmatism," but mimicking the human eye's foveal vision can bring them into focus.

Yongwei Jiang, Yong Jiang, Yixiong Zou +5

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Songjia He +81w ago

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Forget hand-crafted assets and heuristics: V-Dreamer uses video generation models to automatically create diverse, physically plausible robotic simulation environments and trajectories directly from language.

Songjia He, Songjiang He, Zixuan Chen +6

Computer Vision Robotics & Embodied AI World Models & Planning

Haohua Chen +31w ago

CSSDF-Net: Safe Motion Planning Based on Neural Implicit Representations of Configuration Space Distance Field

Differentiable collision checking in configuration space, previously a major hurdle, is now achievable with zero-shot generalization thanks to CSSDF-Net.

Haohua Chen, Yixuan Zhou, Yifan Zhou +1

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization+1

Tsinghua AI1w ago·also Baidu, HKU

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.

Xinyao Zhang, Xinyao Zhang, Wenkai Dong +19

Computer Vision Multimodal Models Natural Language Processing

Cong Wang +81w ago

PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Achieve more physically realistic video generation by explicitly modeling 3D geometry and physical attributes across multiple viewpoints.

Cong Wang, Hanxin Zhu, Xiao Tang +6

Computer Vision Multimodal Models World Models & Planning

Hung-Yue Suen +21w ago

Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

You can predict how engaged and attracted viewers are to a video lecture just by analyzing the speaker's face and voice, no audience data needed.

Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng

Computer Vision Natural Language Processing Speech & Audio

Aditi Naiknaware +11w ago

T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

VLMs can now better detect when they're seeing something they shouldn't, even as the world changes around them, thanks to a new method that dynamically fuses visual and textual cues.

Aditi Naiknaware, S. Sekeh

Computer Vision Multimodal Models Natural Language Processing

Yang Fu +71w ago

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Current video object removal methods leave distracting visual artifacts behind, but EffectErase tackles this problem head-on by jointly removing objects and their pesky visual effects.

Yang Fu, Yang Fu, Yi Zheng +5

Computer Vision Data Curation & Synthetic Data

INESC TEC1w ago·also U Porto

WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Get faithful and plausible natural language explanations for chest X-rays with as few as 5 human-annotated examples per diagnosis, and even boost classification accuracy in the process.

Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models+1

Haitian Li +101w ago

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Unlock real-time 3D understanding: MonoArt achieves state-of-the-art monocular articulated object reconstruction without relying on multi-view data or external motion templates.

Haitian Li, Haozhe Xie, Haozhe Xie +8

Computer Vision Reasoning & Chain-of-Thought Robotics & Embodied AI

Chenyang Gu +111w ago

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Achieve 9x lower trajectory error and 3x better FID in motion generation by using a diffusion-based discrete motion tokenizer that elegantly handles both semantic and kinematic constraints.

Chenyang Gu, Chenyang Gu, Mingyuan Zhang +9

Computer Vision Robotics & Embodied AI World Models & Planning

Swagat Padhan +61w ago

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

VLMs struggle with spatial reasoning, but a clever decomposition into sub-problems and probabilistic recombination unlocks significantly better metric-semantic grounding.

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah +4

Computer Vision Multimodal Models Robotics & Embodied AI

Quentin Guimard +51w ago

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Unlocking fairer vision-language models may be as simple as intervening in the sparse latent space of a sparse autoencoder, enabling targeted bias removal without harming performance.

Quentin Guimard, Federico Bartsch, Simone Caldarella +3

Computer Vision Constitutional AI & AI Ethics Multimodal Models

Zhilin Guo +251w ago

Matryoshka Gaussian Splatting

Get continuous level-of-detail rendering in 3D Gaussian Splatting without sacrificing top-end quality – no architectural changes needed.

Zhilin Guo, Zhilin Guo, Boqiao Zhang +23

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization+1

Jiacheng Tang +51w ago

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Autonomous driving models can be made significantly more robust and safe by explicitly de-confounding their training via causal intervention, eliminating reliance on spurious correlations.

Jiacheng Tang, Zhiyuan Zhou, Zhuolin He +3

Computer Vision Robotics & Embodied AI World Models & Planning

Weilin Chen +51w ago

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Forget generic textures – CustomTex lets you clone real-world object appearances onto your 3D scenes with uncanny fidelity.

Weilin Chen, Jiahao Rao, Wenhao Wang +3

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng +81w ago

Proactive VideoLLMs can finally be both accurate AND efficient thanks to a novel propose-match framework that decouples semantic understanding from streaming perception.

Yikai Zheng, Xin Ding, Yifan Yang +6

Computer Vision Multimodal Models Natural Language Processing

Mohamed Youssef +41w ago

Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Encoding realism as a knowledge graph of interpretable traits unlocks zero-shot sim2real image translation that outperforms state-of-the-art diffusion methods.

Mohamed Youssef, Mayar Elfares, Anna-Maria Meer +2

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

1w ago

Spectrally-Guided Diffusion Noise Schedules

Ditch the handcrafted noise schedules: spectral analysis unlocks per-image diffusion schedules that boost generative quality, especially when you're racing against the clock with few steps.

Carlos Esteves, Carlos Esteves, A. Makadia +1

Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov +41w ago

Autoregressive generative classifiers can beat diffusion models at image classification, but only if you marginalize over token order.

Ilia Sudakov, I. Sudakov, A. Babenko +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Yuqiang Lin +151w ago

TAU-R1: Visual Language Model for Traffic Anomaly Understanding

A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.

Yuqiang Lin, Kehua Chen, Sam Lockyer +13

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Raffaele Cappelli1w ago

Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

Simpler fingerprint enhancement techniques can outperform complex state-of-the-art methods, especially on low-quality images.

Raffaele Cappelli

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan +31w ago

Achieve state-of-the-art panoramic depth estimation without any training by cleverly exploiting the 3D consistency priors embedded within existing vision foundation models.

Jiayi Yuan, Haobo Jiang, De Wen Soh +1

Computer Vision Multimodal Models Robotics & Embodied AI

1w ago·also Zhejiang Lab

Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Unsupervised contrastive learning can now outperform supervised methods for 3D shape matching, while simultaneously slashing computational costs.

Feifan Luo, Fei Luo, Hongyang Chen

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

J. Kang +51w ago

Text-to-image synthesis just got almost 4x faster without sacrificing image quality, thanks to a clever twist on Speculative Jacobi Decoding that keeps the generation process moving even when initial drafts are rejected.

J. Kang, Jialiang Kang, Han Shu +3

Computer Vision Inference & Quantization

Juan Miguel Valverde +41w ago

Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

Achieve topologically-aware image segmentation without cumbersome architectures or expensive computations: SCNP makes it easy.

Juan Miguel Valverde, Dim P. Papadopoulos, Rasmus Larsen +2

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu +91w ago

Compact ViTs can now rival or surpass CNN-based architectures like YOLO for edge-based object detection, instance segmentation, and pose estimation, thanks to task-specialized distillation.

Longfei Liu, Yongjie Hou, Yang Li +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Jiayi Luo +91w ago

Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Ditch the training: SVOO achieves up to 1.93x speedup in video generation with sparse attention by exploiting the intrinsic, layer-specific sparsity patterns of attention without any fine-tuning.

Jiayi Luo, Jiayu Chen, Jiayu Chen +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

L. Bayer +21w ago

Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

CNNs still reign supreme for medical image segmentation on heterogeneous datasets, beating out hybrid transformer models despite the latter's theoretical advantages.

L. Bayer, Sheethal Bhat, Andreas K. MaierCode

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

1w ago

ROFT-VINS: Robust Feature Tracking-based Visual-Inertial State Estimation for Harsh Environment

Deep learning can rescue VIO from textureless environments and rapid lighting changes.

Sanghyun Park, Soohee Han

Computer Vision Robotics & Embodied AI

Teerapong Panboonyuen1w ago

Foundations and Architectures of Artificial Intelligence for Motor Insurance

Automating motor insurance from vehicle damage analysis to claims evaluation is now possible with a vertically integrated AI paradigm.

Teerapong Panboonyuen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Dong Zhuo +121w ago

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

DriveTok achieves unified multi-view reconstruction and understanding by learning scene tokens that integrate semantic, geometric, and textural information, outperforming existing 2D tokenizers in autonomous driving scenarios.

Dong Zhuo, Dong Zhuo, Wenzhao Zheng +10

Computer Vision Multimodal Models Robotics & Embodied AI

Shang-Jui Ray Kuo +31w ago

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

State Space Models can outperform Vision Transformers as vision encoders in VLMs, particularly when model size is a constraint.

Shang-Jui Ray Kuo, Shang-Jui Kuo, Paola Cascante-Bonilla +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Hesong Li +31w ago

Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Achieve atomic-scale clarity in noisy HRTEM images with a novel denoising network that intelligently exploits statistical characteristics in both spatial and frequency domains.

Hesong Li, Ziqi Wu, Ruiwen Shao +1

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

1w ago

Diffusion models can now generate rare concepts and execute complex edits with greater fidelity, thanks to a training-free prompt blending technique that leverages statistical properties of the diffusion process itself.

Kwanyoung Lee, SeungJu Cha, Yebin Ahn +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

1w ago

ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Ditch the finetuning: this training-free method uses attention scores to generate rare concepts in images with more precision and control than LLM-guided approaches.

Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha +3

Computer Vision Multimodal Models Natural Language Processing

Moyang Li +31w ago

DROID-SLAM in the Wild

DROID-SLAM achieves robust real-time RGB SLAM in dynamic environments by explicitly modeling per-pixel uncertainty, outperforming existing methods that struggle with unknown dynamic objects and cluttered scenes.

Moyang Li, Zihan Zhu, Marc Pollefeys +1

Computer Vision Robotics & Embodied AI

Manuscript received X X1w ago

FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning

Even with malicious clients flipping labels, FedTrident recovers federated learning performance to near attack-free levels, outperforming existing defenses by up to 9.49% in critical metrics.

Sheng Liu, P. Papadimitratos, Panos Papadimitratos

Computer Vision Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Zening Sun +51w ago

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.

Zening Sun, Zhengpeng Xie, Lichen Bai +3

Computer Vision Multimodal Models RLHF & Preference Learning

Xiucheng Wang +21w ago

RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction

Achieve near-perfect radio map reconstruction (SSIM 0.9752, PSNR 36.37 dB) from limited data by injecting electromagnetic theory into diffusion models.

Xiucheng Wang, Zixuan Guo, Nan Cheng

Towards Interpretable Foundation Models for Retinal Fundus Images

These authors contributed equally1w ago

You can get state-of-the-art performance on retinal fundus image tasks with an interpretable foundation model that's 16x smaller than the alternatives.

Samuel Ofosu Mensah, Maria Camila Roa Carvajal, K. Djoumessi +3

Computer Vision Interpretability & Mechanistic Interp

Ying Zheng +41w ago

ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

Over-reliance on neighborhood similarity in source-free domain adaptation hurts performance; ProCal offers a way to dynamically calibrate predictions and improve generalization.

Ying Zheng, Yingyue Zheng, Yiyi Zhang +2

UEPS: Robust and Efficient MRI Reconstruction

Xiangsheng Zhou +61w ago

MRI reconstruction can be made dramatically more robust to clinical domain shifts by eliminating the need for explicit coil sensitivity map estimation.

Xiangsheng Zhou, Xiang Zhou, Hong Shang +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

1w ago

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.

Zhan Jin, Zhanpeng Jin, Yuchen Luo +7

Computer Vision Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Hongjia Zhai +71w ago

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Real-time robotic perception just got a major upgrade: OnlinePG achieves open-vocabulary panoptic mapping with 3D Gaussian Splatting, enabling robots to understand and interact with environments in a way that was previously impossible.

Hongjia Zhai, Qi Zhang, Xiaokun Pan +5

Computer Vision Multimodal Models Robotics & Embodied AI

Munich Center for Machine Learning (MCML)1w ago·also TU Munich

Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Synthesized PET scans from MRI can nearly match the diagnostic accuracy of real PET for Alzheimer's, potentially unlocking wider access to crucial functional insights.

Yitong Li, Igor Yakushev, D. Hedderich +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Bishoy Galoaa +41w ago

Motion-o: Trajectory-Grounded Video Reasoning

Visual language models can now explicitly reason about object trajectories in videos, thanks to a simple yet effective method that augments training data and uses discrete motion tags.

Bishoy Galoaa, Bishoy M. Galoaa, Shayda Moezzi +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Yuchen Li +41w ago

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

LVLMs can gain a surprising amount of spatial reasoning ability by explicitly generating segmentation and depth tokens before answering questions.

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Shuqi Xiao +31w ago

REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

LLMs can navigate more efficiently in unfamiliar environments by reasoning over a tree of possible paths, not just isolated waypoints, enabling them to consider en-route information gain and prune unpromising branches.

Shuqi Xiao, Maani Ghaffari, Chengzhong Xu +1

Computer Vision Robotics & Embodied AI World Models & Planning

Nobuo Yoshii +91w ago

Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Radiometric disentanglement from a single image becomes tractable by exploiting the shared illumination constraint across multiple objects, enabling stochastic sampling of reflectance, texture, and illumination.

Nobuo Yoshii, Nobuo Yoshii, Xinran Han +7

Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

1w ago·also Beihang

Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.

Ye Wang, Wei Lu, Zhi-Hui You +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

1w ago

Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Ditch the mask decoder: a single segmentation token can unlock competitive image segmentation directly from MLLMs.

Anqi Zhang, X. Ji, Xiaokang Ji +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

1w ago·also UPM Saudi Arabia

GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Reconstructing realistic hand-object interactions from video just got an order of magnitude faster, thanks to a novel Gaussian Splatting approach that ensures physical consistency.

Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini +5

Computer Vision Multimodal Models Robotics & Embodied AI

Yan Shu +81w ago

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.

Yan Shu, B. Ren, Bin Ren +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Chaoyang Wang +101w ago

Rethinking Vector Field Learning for Generative Segmentation

Diffusion models can generate segmentations that rival discriminative methods, but only if you reshape their vector fields with a distance-aware correction term that combats gradient vanishing.

Chaoyang Wang, Chaoyang Wang, Yaobo Liang +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Chinmay Prabhakar +121w ago·also Equal senior contribution, Unversity of Zurich

VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

Representing complex 3D biomedical graphs as learned tokens unlocks generative modeling and efficient analysis of anatomical structures.

Chinmay Prabhakar, Bastian Wittmann, Tamaz Amiranashvili +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Hui Yang +91w ago

Generalized Hand-Object Pose Estimation with Occlusion Awareness

Overcoming occlusion in hand-object pose estimation just got easier: GenHOI leverages hierarchical semantic knowledge and hand priors to achieve state-of-the-art results on challenging benchmarks.

Hui Yang, Wei Sun, Jian Liu +7

Computer Vision Multimodal Models Robotics & Embodied AI

D. Ben-Ami +31w ago·also Ben-Gurion University of the Negev

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.

D. Ben-Ami, Gabriele Serussi, Kobi Cohen +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Xue Yang +71w ago

End-to-End QGAN-Based Image Synthesis via Neural Noise Encoding and Intensity Calibration

End-to-end quantum image generation is now possible, even with limited qubits, thanks to a new method that bridges the gap between quantum circuits and pixel intensities.

Xue Yang, Rigui Zhou, Shizheng Jia +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Eunseong Choi +111w ago·also School of Mechanical Engineering, Sejong University

Benchmarking Visual Feature Representations for LiDAR-Inertial-Visual Odometry Under Challenging Conditions

Hybrid LiDAR-inertial-visual odometry (LIVO) robustly handles visually challenging conditions, outperforming sparse-direct methods by combining direct photometric methods with learning-based feature descriptors.

Eunseong Choi, Junwoo Hong, Daehan Lee +9

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Yueying Zou +71w ago

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.

Yueying Zou, Peiming Li, Pei Pei Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

1w ago

DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

By combining CNNs and State Space Models, DA-Mamba achieves efficient global-local feature alignment for domain adaptive object detection, outperforming prior CNN-only and Transformer-based approaches.

Haochen Li, Rui Zhang, Hantao Yao +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Bingqi Ma +71w ago

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.

Bingqi Ma, Linlong Lang, Ming Zhang +5

Computer Vision Multimodal Models Speech & Audio

1w ago

MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Restoring faces across age gaps is now possible: MeInTime leverages diffusion models and age-aware guidance to create faithful restorations from cross-age references.

Teer Song, Yue Zhang, Yuqiong Tian +7

Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu +31w ago·also Shenzhen University of Advanced

Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.

Peihang Wu, Zehong Chen, Lijian Xu +1

Computer Vision Inference & Quantization Multimodal Models

Yuqing Wang +191w ago·also HKU

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

High-dimensional discrete tokens, previously out of reach for generative models, can now be directly generated, unlocking a unified token prediction paradigm for multimodal architectures.

Yuqing Wang, Yuqing Wang, Chuofan Ma +17

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Tianjiao Yu +121w ago

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Text-to-3D generation gets a semantic upgrade: DreamPartGen creates 3D objects with parts that not only look right but also understand their relationships and align with textual descriptions.

Tianjiao Yu, Tianjiao Yu, Xinzhuo Li +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Sophia Tang1w ago

Foundations of Schr\"odinger Bridges for Generative Modeling

Schrödinger Bridges elegantly unify diffusion models, score-based models, and flow matching under a single, powerful framework.

Sophia Tang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Tianci Luo +81w ago

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Spatial awareness is the secret ingredient to unlocking better visual in-context learning, boosting performance across diverse vision tasks.

Tianci Luo, Jinpeng Wang, Shi-Yu Qin +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Bruna Alves +21w ago

Unified Taxonomy for Multivariate Time Series Anomaly Detection using Deep Learning

The chaos of MTSAD research gets a little tamer with a new taxonomy that exposes the field's hidden convergence on Transformers and reconstruction, hinting at where the next breakthroughs will come from.

Bruna Alves, Armando J. Pinho, Sónia Gouveia

Architecture Design (Transformers, SSMs, MoE)Computer Vision Natural Language Processing

Rong Fu +121w ago·also Dalian Maritime University

SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Ditch the slow per-scene optimization: SwiftGS meta-learns transferable priors for satellite surface reconstruction, enabling single-pass 3D recovery.

Rong Fu, J. Wu, Jiekai Wu +10

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu +121w ago

MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.

Xianjin Wu, Xian Wu, Dingkang Liang +10

Computer Vision Multimodal Models World Models & Planning

Wei Wang +11w ago

Color image restoration based on nonlocal saturation-value similarity

Color image restoration gets a boost: exploiting saturation-value similarity in nonlocal methods yields significantly better results than relying on individual RGB channels.

Wei Wang, Yakun Li