Computer Vision - Weekly Roundup

TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

Qiucheng Yu +61d ago

Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.

Qiucheng Yu, Ruijie Xu, Mingang Chen +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Kuniko Paxton +51d ago

Exploring the Impact of Skin Color on Skin Lesion Segmentation

Forget Fitzpatrick scores: lesion-skin contrast is the real culprit behind skin lesion segmentation errors, not overall skin tone.

Kuniko Paxton, Medina Kapo, Amila Akagić +3

Computer Vision Constitutional AI & AI Ethics

Hengyu Zeng +71d ago

MacTok: Robust Continuous Tokenization for Image Generation

Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.

Hengyu Zeng, Xin Gao, Guanghao Li +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

IIT1d ago·also Ashoka

Generating Key Postures of Bharatanatyam Adavus with Pose Estimation

Pose-guided GANs and diffusion models can faithfully generate complex cultural dance postures, opening new avenues for digital preservation and education.

Jagadish Kashinath Kamble, Jayanta Mukhopadhyay, Debaditya Roy +1

Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Sowmya Vajrala +61d ago

Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.

Sowmya Vajrala, Aakash Parmar, Prasanna R +4

Computer Vision Inference & Quantization Multimodal Models

1d ago

iPoster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models

Forget tedious poster design – iPoster lets you sketch your vision and then uses a smart diffusion model to instantly generate polished, content-aware layouts that respect your constraints.

Xudong Zhou, Jinyuan Liang, Qiuyi Guo +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision

T. Simon +41d ago

Few-shot Writer Adaptation via Multimodal In-Context Learning

Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.

T. Simon, Stéphane Nicolas, Pierrick Tranouez +2

Computer Vision Multimodal Models Natural Language Processing

Youngung Han +121d ago·also NVIDIA, Corresponding Author

NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

Overcoming the challenge of limited and inconsistent imaging criteria for perineural invasion (PNI) diagnosis, NeoNet achieves state-of-the-art prediction accuracy by generating synthetic training data with a 3D Latent Diffusion Model.

Youngung Han, M. Cha, I. Um +10

Computer Vision Scientific Discovery & Drug Design

1d ago

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.

Yubo Cui, Xianchao Guan, Zijun Xiong +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Qiyuan Zhuang +51d ago

RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment

Robots can now generalize to unseen objects and categories for manipulation tasks with only a few training examples, thanks to a novel retrieval-augmented affordance prediction framework.

Qiyuan Zhuang, He-Yang Xu, Yijun Wang +3

Computer Vision Recommendation & Information Retrieval Robotics & Embodied AI

Jianpeng Wang +61d ago·also Tsinghua AI

PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.

Jianpeng Wang, Haoyu Wang, Baoying Chen +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Fu Wang +61d ago

Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields

Quantum-inspired architectures can significantly improve 3D cloud forecasting by better capturing nonlocal dependencies, outperforming classical methods like ConvLSTM and Transformers.

Fu Wang, Qifeng Lu, X. Long +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

Corresponding authors1d ago

Hallucination-aware intermediate representation edit in large vision-language models

Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.

Wei Suo, Hanzu Zhang, Lijun Zhang +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Kavindu Herath +21d ago

Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.

Kavindu Herath, Joshua Zhao, Saurabh Bagchi

Computer Vision Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Yingke Wang +81d ago

IMPASTO: Integrating Model-Based Planning with Learned Dynamics Models for Robotic Oil Painting Reproduction

Robots can now learn to reproduce oil paintings with impressive accuracy through self-play and learned dynamics, even without human demonstrations or high-fidelity simulators.

Yingke Wang, Hao Li, Yifeng Zhu +6

Computer Vision Robotics & Embodied AI World Models & Planning

Guozhi Qiu +61d ago

MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network

Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.

Guozhi Qiu, Zhiwei Chen, Zixu Li +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

1d ago·also BRAC University

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman +3

Computer Vision Multimodal Models Natural Language Processing

D. Bani-Harouni +81d ago

Calibrated Confidence Expression for Radiology Report Generation

Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.

D. Bani-Harouni, Chantal Pellegrini, J. Luers +6

Computer Vision Multimodal Models Natural Language Processing

Rui Bao +51d ago

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Diffusion-based watermarks, thought to be secure, can be completely bypassed with a simple stochastic resampling trick that breaks trajectory reconstruction.

Rui Bao, Zheng Gao, Xiaoyu Li +3

Computer Vision Red-Teaming & Adversarial Robustness

Abdullah Thabit +71d ago

SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays

Open-source SurgNavAR slashes the barrier to entry for AR surgical navigation research, offering a ready-to-use framework adaptable to diverse surgical applications.

Abdullah Thabit, Mohamed Benmahdjoub, Rafiuddin Jinabade +5

Conditional Polarization Guidance for Camouflaged Object Detection

Dalian Maritime University1d ago·also DUT, HKUST

Polarization cues, often overlooked, can significantly boost camouflaged object detection by explicitly guiding RGB feature learning, leading to state-of-the-art performance.

QIfan Zhang, Hao Wang, Xiangrong Qin +1

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

Tsinghua AI1d ago·also NJU, PKU

GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.

Wenyi Li, Renkai Luo, Yue Yu +5

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Rosario Leonardi +31d ago

Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

Synthetic data, when carefully aligned with real-world characteristics, can boost hand-object interaction detection by over 11% even when real labeled data is scarce.

Rosario Leonardi, Antonino Furnari, Francesco Ragusa +1

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Sicheng Lu +91d ago·also NJU

Scaling Video Pretraining for Surgical Foundation Models

Vision-language models falter at the fine-grained temporal recognition crucial for surgical video understanding, while SurgRec excels.

Sicheng Lu, Zikai Xiao, Jianhui Wei +7

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Shi Li +81d ago

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.

Shi Li, Vinkle Srivastav, Nicolas Chanel +6

Computer Vision Multimodal Models Natural Language Processing

Jun-Woo Heo +21d ago

Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

By separating known and unknown object representations into orthogonal subspaces, DEUS achieves state-of-the-art open world object detection, outperforming prior methods that struggle to learn distinct unknown object representations.

Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Vanessa Emanuela Guarino +91d ago

Simply averaging pixel-level uncertainty in image segmentation throws away crucial spatial information, leading to worse performance on downstream tasks like detecting when your model is likely to fail.

Vanessa Emanuela Guarino, Claudia Winklmayr, Jannik Franzen +7

Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang +71d ago

Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.

Yuhang Yang, Fan Zhang, Huaijin Pi +5

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan +11d ago

Diffusion-based feature denoising can significantly bolster the robustness of handwritten digit classifiers against adversarial attacks, even outperforming standard CNNs.

Hiba Adil Al-kharsan, Róbert Rajkó

Computer Vision Red-Teaming & Adversarial Robustness

Gaurab Baral +11d ago

AutoFormBench: Benchmark Dataset for Automating Form Understanding

YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.

Gaurab Baral, Junxiu Zhou

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Thomas Tanay +71d ago

GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.

Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk +5

Computer Vision Multimodal Models World Models & Planning

Jijun Lu +71d ago

Compressive sensing inspired self-supervised single-pixel imaging

Single-pixel imaging gets a deep learning boost: SISTA-Net leverages learned sparsity and hybrid CNN-VSSM architectures to achieve state-of-the-art reconstruction quality, even in noisy underwater environments.

Jijun Lu, Yifan Chen, Libang Chen +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Abderrezzaq Sendjasni +11d ago

Multi-Feature Fusion Approach for Generative AI Images Detection

Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.

Abderrezzaq Sendjasni, Mohamed-Chaker Larabi

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Boshko Koloski +41d ago

MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.

Boshko Koloski, Marjan Stoimchev, Jurica Levatić +2

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

Tsinghua AI1d ago·also Duke, EPFL

Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.

Fengyang Xiao, Peng Hu, Lei Xu +7

Computer Vision Data Curation & Synthetic Data

Fengjian Xue +101d ago·also Corresponding author

FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.

Fengjian Xue, Xuecheng Wu, Heli Sun +8

Computer Vision Eval Frameworks & Benchmarks

Anmin Liu +71d ago·also Department of Electronic Engineering

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.

Anmin Liu, Ruixuan Yang, Huiqiang Jiang +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

1d ago·also UWA, Xidian

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.

Ning Wang, Tieyue Wu, Naeha Sharif +5

Computer Vision Multimodal Models Natural Language Processing

Dimitrios Anastasiou +81d ago

CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment

Forget expensive labels: CoRe-DA leverages contrastive learning and self-training to achieve state-of-the-art surgical skill assessment across diverse surgical environments without requiring target domain annotations.

Dimitrios Anastasiou, Razvan Caramalau, Jialang Xu +6

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization

Andrea DeMarco +41d ago

STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

1d ago·also WHU

Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

Masked motion generators struggle with complex movements because they treat all frames the same – until now.

Pengfei Zhou, Xiangyue Zhang, Xukun Shen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Sherif Abdelwahab1d ago

Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.

Sherif Abdelwahab

Computer Vision Multimodal Models Recommendation & Information Retrieval

Johann-Ludwig Herzog +71d ago

BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.

Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel +5

Computer Vision Data Curation & Synthetic Data Multimodal Models

Jules Ripoll +41d ago

FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models

FlowID enables forensic facial reconstruction on damaged faces with better identity preservation and lower computational cost than existing methods, potentially accelerating victim identification in violent deaths.

Jules Ripoll, David Bertoin, Alasdair Newson +2

Computer Vision Data Curation & Synthetic Data

Rongkang Dong +41d ago

Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

Diffusion models can beat discriminative classifiers at facial expression recognition, but only with a dynamically adjusted margin loss that accounts for per-sample difficulty.

Rongkang Dong, Cuixin Yang, Cong Zhang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Francesco Moretti +21d ago

Transmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing

Nighttime image dehazing gets a boost from a structure-texture decomposition that enhances details and corrects color biases in the YUV color space.

Francesco Moretti, Giulia Bianchi, Andrea Gallo

All-in-One Augmented Reality Guided Head and Neck Tumor Resection

Yue Yang +71d ago

Surgeons can now pinpoint tumor margins with millimeter precision using augmented reality, potentially reducing positive margins in head and neck cancer resections.

Yue Yang, M. Chabanas, Carrie Reale +5

Square Superpixel Generation and Representation Learning via Granular Ball Computing

Shuyin Xia +81d ago

Square superpixels, generated via granular ball computing, unlock efficient parallel processing and end-to-end optimization in deep learning pipelines by replacing irregular shapes with multi-scale square blocks.

Shuyin Xia, Meng Yang, Dawei Dai +6

EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images

1d ago·also Asterisk Labs, LGND AI

Querying satellite imagery just got easier: EarthEmbeddingExplorer lets you find images using text, visuals, or location, unlocking insights previously trapped in research papers.

Yijie Zheng, Weijie Wu, Bingyue Wu +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

Ziyang Chen +41d ago

StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

A training-free feature adjustment pipeline unlocks the power of Visual Geometry Grounded Transformers for stereo vision, achieving state-of-the-art results on KITTI.

Ziyang Chen, Yansong Qu, You Shen +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Antoine Bottenmuller +21d ago

Polyhedral Unmixing: Bridging Semantic Segmentation with Hyperspectral Unmixing via Polyhedral-Cone Partitioning

Turn semantic segmentation into hyperspectral unmixing with a surprisingly simple pipeline that leverages polyhedral-cone partitioning, outperforming existing deep and non-deep methods.

Antoine Bottenmuller, Etienne Decencière, Petr Dokl'adal

Computer Vision Scientific Discovery & Drug Design

Taewoo Suh +31d ago

AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting

Rendering artifacts in feed-forward 3D Gaussian Splatting? Solved: AA-Splat delivers a whopping 7dB PSNR boost by fixing screen-space dilation filters.

Taewoo Suh, Sungpyo Kim, Jongmin Park +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Chenxin Zhu +71d ago

A2BFR: Attribute-Aware Blind Face Restoration

Finally, a blind face restoration method that doesn't just hallucinate details, but lets you precisely control facial attributes via text prompts while maintaining high fidelity.

Chenxin Zhu, Yushun Fang, Lu Liu +5

Multimodal Models Meet Presentation Attack Detection on ID Documents

Marina Villanueva +21d ago

Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.

Marina Villanueva, Juan M. Espín, Juan E. Tapia

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Ni Ou +31d ago·also ByteDance

Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

Ditching depth map projections for camera-LiDAR calibration unlocks significant gains in accuracy and robustness, especially when starting from poor initial extrinsic estimates.

Ni Ou, Zhuo Chen, Xinru Zhang +1

Computer Vision Multimodal Models Robotics & Embodied AI

Jintao Sun +31d ago

Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties

Quantifying and integrating map uncertainty—both positional and semantic—into trajectory prediction pipelines significantly boosts forecast accuracy, even when using existing baseline models.

Jintao Sun, Hu Zhang, Gangyi Ding +1

Computer Vision Robotics & Embodied AI World Models & Planning

U.V.B.L. Udugama +21d ago

M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

Achieve a 60% reduction in trajectory error for monocular SLAM by tightly integrating multi-task dense prediction with a compact perception-to-mapping interface.

U.V.B.L. Udugama, G. Vosselman, F. Nex

MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

Gim Hee Lee1d ago

Reconstructing dynamic 3D scenes from video just got a whole lot better: MotionScale achieves state-of-the-art fidelity and temporal stability by scaling Gaussian splatting to long, complex sequences.

Gim Hee Lee

Computer Vision Robotics & Embodied AI World Models & Planning

Yaning Zhang +41d ago

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.

Yaning Zhang, Linlin Shen, Zitong Yu +2

Computer Vision Multimodal Models Natural Language Processing

Yunnan Normal University1d ago

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Stop segmenting remote sensing images in isolation: modeling inter-unit dependencies boosts open-vocabulary segmentation accuracy by up to 6%.

Wenyang Chen, Zhanxuan Hu, Yaping Zhang +2

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Jingqi Xu1d ago

Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.

Jingqi Xu

Computer Vision Multimodal Models Natural Language Processing

Phonphrm Thawatdamrongkit +21d ago

Diffusion Mental Averages

Forget blurry averages – DMA unlocks sharp, realistic concept prototypes directly within diffusion models, offering a new lens into model understanding and bias.

Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Tao Chen +71d ago

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.

Tao Chen, Kun Zhang, Qiong Wu +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Yanjiao Song +61d ago

Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method

Publicly available satellite imagery can now estimate building heights with state-of-the-art accuracy thanks to a new dataset and network architecture designed for the task.

Yanjiao Song, Bowen Cai, T. Balz +4

Computer Vision Data Curation & Synthetic Data

Zikai Liao +11d ago

CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection

By explicitly modeling camouflage and distractors, CCDNet achieves state-of-the-art infrared small target detection, even in challenging environments where targets blend into the background.

Zikai Liao, Zhaozheng Yin

LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

Tianyu Huang +71d ago

Forget tedious optimization – LightHarmony3D generates realistic lighting and shadows for inserted 3D objects in a single pass, making scene augmentation feel truly real.

Tianyu Huang, Zhenyang Ren, Zhenchen Wan +5

Computer Vision Multimodal Models Robotics & Embodied AI

Chenlong He +61d ago

HLC: A High-Quality Lightweight Mezzanine Codec Featuring High-Throughput Palette

A novel data-dependency-free palette unlocks high-throughput, low-resource mezzanine coding, outperforming JPEG-XS while slashing LUT resource usage in half.

Chenlong He, Leilei Huang, Wei Li +4

Computer Vision Inference & Quantization

Yi Hu +31d ago

Editing on the Generative Manifold: A Theoretical and Empirical Study of General Diffusion-Based Image Editing Trade-offs

Diffusion-based image editing's impressive flexibility comes with fundamental trade-offs between controllability, faithfulness, consistency, locality, and quality, which this paper exposes with clear theoretical bounds.

Yi Hu, Leying Yi, Emily Davis +1

3D Architect: An Automated Approach to Three-Dimensional Modeling

Sunil Tiwari +21d ago

Turn 2D orthographic views into 3D models automatically using corner detection and geometric reconstruction.

Sunil Tiwari, Payal Fofadiya, Vicky Vishwakarma

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

1d ago·also V evaluation systems. Numerous

Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Duc V. Nguyen +21d ago

Subjective Quality Assessment of Dynamic 3D Meshes in Virtual Reality Environment

You can halve the polygon count of dynamic 3D meshes in VR without users noticing, but existing quality metrics won't tell you that.

Duc V. Nguyen, Nguyen Thi Quynh Ly, Truong Thu Huong

CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Yuan Hao +51d ago

Humanoids can now nimbly navigate real-world clutter and complex terrain using only raw depth data, ditching hand-engineered geometric representations.

Yuan Hao, Ruiqi Yu, Shixin Luo +3

Computer Vision Multimodal Models Robotics & Embodied AI

Max Lodel +31d ago

Learning Semantic Priorities for Autonomous Target Search

Forget brute-force coverage – this method learns from simulated expert guidance to prioritize semantically relevant areas, dramatically speeding up target search in unseen environments.

Max Lodel, Nils Wilde, Robert Babuvska +1

Computer Vision Robotics & Embodied AI Tool Use & Agents

Tomoki Ishikura +31d ago

Industrial-Grade Robust Robot Vision for Screw Detection and Removal under Uneven Conditions

Automating disassembly of complex, degraded appliances in recycling plants is now feasible, achieving high accuracy without pre-programmed coordinates.

Tomoki Ishikura, Genichiro Matsuda, Takuya Kiyokawa +1

SuperGrasp: Single-View Object Grasping via Superquadric Similarity Matching, Evaluation, and Refinement

Lijingze Xiao +41d ago

SuperGrasp achieves robust single-view grasping by cleverly combining superquadric-based similarity matching with an end-to-end refinement network, outperforming existing methods in stability and generalization.

Lijingze Xiao, Jinhong Du, Yang Cong +2

Kernel-SDF: An Open-Source Library for Real-Time Signed Distance Function Estimation using Kernel Regression

Zhirui Dai +61d ago

Real-time, uncertainty-aware signed distance functions are now possible without sacrificing accuracy, thanks to a novel kernel regression and GP regression hybrid.

Zhirui Dai, Tianxing Fan, Mani Amani +4

Computer Vision Open-Source Models & Weights Robotics & Embodied AI

Sen Wang +91d ago

Efficient Camera Pose Augmentation for View Generalization in Robotic Policy Learning

Policies trained with GenSplat maintain robust performance under severe spatial perturbations where baseline methods completely fail, thanks to its novel 3D Gaussian Splatting-based augmentation.

Sen Wang, Huaiyi Dong, Jingyi Tian +7

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Jaber Jaber +11d ago

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

World models can achieve state-of-the-art video prediction and emergent object decomposition by combining object-centric slots, hierarchical temporal dynamics, and learned causal interaction graphs.

Jaber Jaber, O. Jaber

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

Mingyeong Song +21d ago

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Computer Vision Multimodal Models Speech & Audio

Xuesong Wang +11d ago

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.

Xuesong Wang, Harry Wang

Computer Vision Multimodal Models Tool Use & Agents

1d ago

Video-Oasis: Rethinking Evaluation of Video Understanding

Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.

Geuntaek Lim, Minho Shim, Sungjune Park +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Min Lu +61d ago

Abstraction in Style

Style transfer can now capture the essence of artistic abstraction, not just surface-level appearance, by explicitly reinterpreting image structure.

Min Lu, Yuanfeng He, Anthony Chen +4

OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

1d ago·also Adobe Research, HKU, UCSD, UPenn

Finally, a video generation model lets you roam through a scene with long-term spatial and temporal consistency, opening up new possibilities for virtual exploration.

Yuheng Liu, Xin Lin, Xinke Li +9

Computer Vision Multimodal Models World Models & Planning

Mingkun Tan +31d ago

Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.

Mingkun Tan, Xilu Wang, Michael Kloster +1

Computer Vision Data Curation & Synthetic Data Distributed Systems & Hardware

Minyoung E. Kim +61d ago

Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

Existing object detection models stumble when faced with the morphological diversity of cells in high-resolution, whole-brain microscopy data, revealing a critical gap in their generalization ability.

Minyoung E. Kim, Dae Hee Yun, Aditi V. Patel +4

Computer Vision Scientific Discovery & Drug Design

Fumihiko Tsuchiya +61d ago

EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Current multimodal LLMs struggle to count objects and ground evidence in videos longer than 30 minutes, achieving only ~25% accuracy compared to human performance on a new benchmark.

Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai +4

Computer Vision Eval Frameworks & Benchmarks

Kaleb Newman +21d ago

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Video diffusion models lock in their high-level plan almost immediately, suggesting a new path to scaling their reasoning abilities by focusing compute on promising early trajectories.

Kaleb Newman, Tyler Zhu, Olga Russakovsky

Computer Vision Reasoning & Chain-of-Thought World Models & Planning

Mar 30, 2026

2d ago

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Unleashing creative potential in text-to-image models just got easier: on-the-fly repulsion in the contextual space lets you steer diffusion transformers towards richer diversity without sacrificing image quality or blowing your compute budget.

Omer Dahary, Benaya Koren, Daniel Garibi +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2d ago

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.

Kailai Feng, Yuxiang Wei, Bo Chen +6

Computer Vision Inference & Quantization Multimodal Models

Kaituo Feng +92d ago

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Image generation takes a leap towards real-world knowledge by training an agent that actively searches for and integrates external information, substantially boosting performance on knowledge-intensive tasks.

Kaituo Feng, Manyuan Zhang, Shuang Chen +7

Computer Vision Multimodal Models Tool Use & Agents

Ikechukwu Uchendu +72d ago

See it to Place it: Evolving Macro Placements with Vision-Language Models

Zero-shot Vision-Language Models can now guide chip floorplanning, beating specialized ML methods by up to 32% without any fine-tuning.

Ikechukwu Uchendu, Swati Goel, Karly Hou +5