Computer Vision - Weekly Roundup

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang +10Apr 27, 2026

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

All Papers (100)

Apr 27, 2026

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxing Liu, Xinxin Liu, Ming Li +3

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Zhongjie Duan +2Apr 27, 2026

Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

Finally, a plugin framework that lets you mix-and-match KV-Cache, LoRA, and other controls to steer diffusion models without being locked into a specific backbone.

Zhongjie Duan, Hong Zhang, Yingda Chen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Open-Source Models & Weights

Apr 27, 2026·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

Hao Wang +6Apr 27, 2026

X2SAM: Any Segmentation in Images and Videos

Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.

Hao Wang, Limeng Qiao, Chi Zhang +4

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang +10Apr 27, 2026

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Google ResearchApr 27, 2026·also LinkedIn Corporation

Co-Director: Agentic Generative Video Storytelling

Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.

Yale Song, Yale Song, Yiwen Song +29

Computer Vision Multimodal Models Tool Use & Agents

Apr 27, 2026

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.

Yiming Zhang, Jiacheng Chen, Jiaqi Tan +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Boyang Wang +4Apr 27, 2026

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

State-of-the-art shot boundary detection gets a major upgrade with a Transformer-based approach that not only improves accuracy but also offers more interpretable boundaries, thanks to a novel relational prediction framework and synthetic training data.

Boyang Wang, Guangyi Xu, Zhipeng Tang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Zhiheng Liu +14Apr 27, 2026

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.

Zhiheng Liu, Weiming Ren, Xiaoke Huang +12

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Eva Krueger +2Apr 27, 2026

An analysis of sensor selection for fruit picking with suction-based grippers

Knowing *when* to listen to *which* sensor lets robotic fruit pickers predict failures before they happen, boosting accuracy to 90% even with minimal sensor sets.

Eva Krueger, Marcus Rosette, Joseph R. Davidson

Generative diffusion models for spatiotemporal influenza forecasting

Apr 27, 2026·also UNC

Diffusion models, typically used for image generation, can now forecast infectious disease with accuracy rivaling traditional ensemble methods, offering a new tool for public health.

J. Lemaitre, Justin Lessler

Computer Vision Scientific Discovery & Drug Design

G. Channing +6Apr 27, 2026

Contrastive Image-Metadata Pre-Training for Materials Transmission Electron Microscopy

Unlock the secrets hidden in your lab's backed-up microscopy data: style transfer networks can now "re-imagine" images as if they were captured with different instrument settings.

G. Channing, D. Keller, M. Rossell +4

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Jun Li +9Apr 27, 2026·also Tsinghua AI

Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

Frozen vision-language models can dramatically improve abnormality grounding in rare disease imaging by iteratively refining decisions through optimized instructions and visual perturbations.

Jun Li, Mingxuan Liu, Jiazhen Pan +7

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Yifei Wei +8Apr 27, 2026

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Decomposing robotic manipulation into coarse and fine-grained actions isn't just conceptually cleaner—it actually unlocks a sweet spot where learning difficulty is balanced, boosting performance.

Yifei Wei, Linqing Zhong, Yi Liu +6

Computer Vision Multimodal Models Robotics & Embodied AI

C. O’Brien +3Apr 27, 2026

Evaluation of Pose Estimation Systems for Sign Language Translation

Your sign language translation model's performance could be bottlenecked by your choice of pose estimator: switching from MediaPipe to SDPose or Sapiens could boost BLEU score by 1.5 points.

C. O’Brien, Gerard Sant, Mathias Muller +1

Computer Vision Eval Frameworks & Benchmarks Natural Language Processing

Dazhuang Liu +3Apr 27, 2026

DETOUR: A Practical Backdoor Attack against Object Detection

Object detection models are surprisingly vulnerable to practical backdoor attacks using real-world semantic triggers that work across different sizes, locations, and viewpoints.

Dazhuang Liu, Yanqi Qiao, Kaitai Liang +1

Computer Vision Red-Teaming & Adversarial Robustness

Shrisudhan Govindarajan +5Apr 27, 2026

Power Foam: Unifying Real-Time Differentiable Ray Tracing and Rasterization

Real-time differentiable rendering just got a whole lot faster: Power Foam unifies ray tracing and rasterization, rivaling 3DGS performance without sacrificing ray tracing benefits.

Shrisudhan Govindarajan, Daniel Rebain, Dor Verbin +3

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Mohamad Zamini +1Apr 27, 2026

Achieve SOTA zero-shot segmentation by simply fusing two CLIP branches, one focusing on local token reliability and the other on structural priors, all without training.

Mohamad Zamini, Diksha Shukla

BifDet: A 3D Bifurcation Detection Dataset for Airway-Tree Modeling

Ali Keshavarzi +3Apr 27, 2026

Finally, a dataset exists to train and benchmark algorithms for automatically detecting airway bifurcations in 3D CT scans, a crucial step towards understanding respiratory diseases.

Ali Keshavarzi, Quentin Bouniot, Benjamin M. Smith +1

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Jongwoo Nam +2Apr 27, 2026

ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

Even the best vision models make shockingly bad shape recognition errors, like confusing a car with a chair, when evaluated on a new viewpoint-invariant shape recognition benchmark.

Jongwoo Nam, Amanda Rios, Bartlett W. Mel

Computer Vision Eval Frameworks & Benchmarks

F. Gustafsson +4Apr 27, 2026

Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Scaling up pathology foundation models doesn't guarantee better survival prediction—a distilled model with 8% of the parameters can outperform its larger teacher.

F. Gustafsson, C. Boissin, J. Vallon-Christersson +2

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Haosong Xiao +4Apr 27, 2026

Infrastructure-Guided Connectivity-Enhanced Road Crack Detection and Estimation

Road crack detection gets a boost by having the infrastructure tell the car where to look.

Haosong Xiao, Yamini Ramesh, R. Shukla +2

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Apr 27, 2026

Agentic AI struggles with Earth Observation because reprojection, resampling, and other geospatial operations silently corrupt data, demanding a new agent design paradigm.

Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir +5

Computer Vision Multimodal Models Tool Use & Agents

Jorge L. A. Lima +1Apr 27, 2026

Aycromo: An Open-Source Platform for Automatic Chromosome Detection in Metaphase Images Based on Deep Learning

Cytogeneticists can now slash chromosome analysis time from days to seconds with Aycromo, an open-source platform that democratizes access to high-performance deep learning models.

Jorge L. A. Lima, Filipe R. Cordeiro

Computer Vision Open-Source Models & Weights Scientific Discovery & Drug Design

Antoine P. Leeman +3Apr 27, 2026

VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis

Safe visuomotor control from high-resolution images is now practical at scale, thanks to a learned visual abstraction coupled with an efficient SLS solver.

Antoine P. Leeman, Shuyu Zhan, M. Zeilinger +1

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Maitreya Patel +4Apr 27, 2026

Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.

Maitreya Patel, Jingtao Li, Weiming Zhuang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 27, 2026·also UofT

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

Text-guided 3D medical image segmentation just got a whole lot more practical: ESICA achieves state-of-the-art accuracy with a "Lite" variant that slashes parameter count without sacrificing performance.

Yuelin Xin, Gorkem Can Ates, Jun Ma +4

Computer Vision Multimodal Models Natural Language Processing

Nikesh Subedi +2Apr 27, 2026

Interactive Episodic Memory with User Feedback

Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.

Nikesh Subedi, Loris Bazzani, Ziad Al-Halah

Computer Vision Multimodal Models Natural Language Processing

Weijie Wang +9Apr 27, 2026·also Microsoft Research

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Text-to-video models can now learn geometrically consistent world dynamics via reinforcement learning, without expensive architectural changes.

Weijie Wang, Youping Gu, Zeyu Zhang +7

Computer Vision Multimodal Models World Models & Planning

Vandita Shukla +3Apr 27, 2026

WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring

Unlock species-agnostic 3D tracking from standard drone footage with WildLIFT, turning 2D video into structured, viewpoint-aware representations for richer wildlife analysis.

Vandita Shukla, Fabio Remondino, Blair R. Costelloe +1

Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

Guangdong University of TechnologyApr 27, 2026·also Kuaishou, PKU, SYSU

Test-time adaptation of vision-language models can actually *hurt* performance when modalities shift asymmetrically; MG-MTTA fixes this by explicitly modeling modality reliability.

Lixian Chen, Mingxuan Huang, Yan-Hong Chen +2

Computer Vision Multimodal Models Natural Language Processing

Haoxiao Wang +10Apr 27, 2026

Diffusion Model as a Generalist Segmentation Learner

Turns out, your image-generating diffusion model already knows how to segment anything you ask it to.

Haoxiao Wang, Antao Xiang, Haiyang Sun +8

Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

Yuanhao Gong +2Apr 27, 2026

Achieve real-time, accurate image reconstruction from sparse Laplacian fields using a wavelet neural network with only 200 parameters.

Yuanhao Gong, Tang Tang, Qianyan Liu

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Hamed Rahimi +4Apr 27, 2026

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Robots can now understand human intentions with near-human accuracy thanks to a new video-language model that reasons about goals like a human.

Hamed Rahimi, Clémence Grislain, Adrien Jacquet Cretides +2

Computer Vision Multimodal Models Robotics & Embodied AI

Shaunak Kolhe +13Apr 27, 2026

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Radar odometry, typically confined to urban settings, can be pushed off-road with simple adaptations like IMU preintegration, but still faces significant challenges in unstructured environments.

Shaunak Kolhe, Shaunak Kolhe, Peng Jiang +11

ARETE: Attention-based Rasterized Encoding for Topology Estimation using HSV-transformed Crowdsourced Vehicle Fleet Data

Daniel Fritz +4Apr 27, 2026

Encoding vehicle trajectory directionality via HSV rasterization unlocks accurate lane-level HD map generation from crowdsourced data using a DETR architecture.

Daniel Fritz, Dimitrios Lagamtzis, M. Mink +2

TEACar: An Open-Source Autonomous Driving Platform

Zhongzheng Zhang +7Apr 27, 2026

An open-source autonomous driving platform offers researchers a modular, scalable, and cost-effective alternative to complex and restrictive hardware validation setups.

Zhongzheng Zhang, Maxwell Ruyle, A. Kappes +5

Computer Vision Open-Source Models & Weights Robotics & Embodied AI

University of GuilanApr 27, 2026

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Robots can now "see" and understand doorways, enabling more robust navigation in complex indoor environments.

Ali Tourani, Miguel Fernández-Cortizas, Saad Ejaz +4

Real-time windrow detection from onboard tractor sensors for automated following

Lorenz Gunreben +4Apr 27, 2026

Low-cost stereo vision can rival LiDAR for real-time windrow detection, paving the way for more accessible autonomous farming solutions.

Lorenz Gunreben, Nico Heider, Sebastian Zürner +2

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

W. Z. E. Amri +1Apr 27, 2026·also Leibniz Universität Hannover

SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

Simulate once, deploy anywhere: SPLIT lets you train tactile perception models on synthetic data and transfer them across different sensors without retraining.

W. Z. E. Amri, Nicolás Navarro-Guerrero

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

NVIDIAApr 27, 2026

MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives

Forget clunky animation pipelines – MotionBricks lets you assemble real-time, high-quality character motions like LEGOs, even controlling robots.

Tingwu Wang, Olivier Dionne, Mick Ruyter +13

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Sheng Zhong +9Apr 27, 2026

Event-based SLAM Benchmark for High-Speed Maneuvers

Current event-based SLAM algorithms falter when faced with the full complexity of high-speed, 6-DoF maneuvers, highlighting a gap between current capabilities and the promise of event cameras.

Sheng Zhong, Junkai Niu, Guillermo Gallego +7

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Zirui Chen +2Apr 27, 2026

Guiding Vector Field Generation via Score-based Diffusion Model

Score-based diffusion models can now generate robust guiding vector fields for robotic path following, even when traditional methods stumble on unordered, branching, or probabilistically-generated paths.

Zirui Chen, Shiliang Guo, Shiyu Zhao

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

Siyao Xiao +11Apr 27, 2026

Forget end-to-end fine-tuning: $M^2$-VLA unlocks the power of generalized VLMs for robotic manipulation by intelligently mixing layers and incorporating meta-skills.

Siyao Xiao, Yuhong Zhang, Zhifang Liu +9

Computer Vision Multimodal Models Robotics & Embodied AI

Xi Shen +3Apr 27, 2026

Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

Ditch silicon bottlenecks: a novel optoelectronic correlator uses cold atoms to accelerate 3D CNNs by orders of magnitude.

Xi Shen, Bowen Qi, Tabassom Hamidfar +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware

Anthony Faure-Gignoux +3Apr 27, 2026

Compilation and Execution of an Embeddable YOLO-NAS on the VTA

Compiling and executing YOLO-NAS on an FPGA-based accelerator is now possible, opening doors for real-time object detection in safety-critical applications like aeronautics.

Anthony Faure-Gignoux, Kevin Delmas, Adrien Gauffriau +1

Computer Vision Distributed Systems & Hardware Inference & Quantization

T. Grossman +3Apr 27, 2026

DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation

Ditch the prompts: DiffuSAM adapts SAM2 for medical image segmentation by synthesizing mask embeddings with a diffusion model, achieving strong performance without fine-tuning or expert input.

T. Grossman, N. Cahan, Lev Ayzenberg +1

Computer Vision Scientific Discovery & Drug Design Training Efficiency & Optimization

Esteban Rodr'iguez-Betancourt +1Apr 27, 2026

Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval

Self-supervised vision models that ace linear probing can still flop at semantic image retrieval because of skewed latent space geometry that breaks approximate nearest neighbor search.

Esteban Rodr'iguez-Betancourt, Edgar Casasola-Murillo

Computer Vision Multimodal Models Recommendation & Information Retrieval

MIT CSAILApr 27, 2026·also AI for Responsible, Beth Israel Deaconess Medical Center, Bordeaux Population Health Research Center, Clinical Research Center +8

Quantum Kernel Advantage over Classical Collapse in Medical Foundation Model Embeddings

Quantum kernels unlock signal in medical image embeddings where classical methods fail, suggesting a new path for extracting value from medical foundation models.

Sebastian Cajas Ordóñez, Sebastian Cajas Ord'onez, Felipe Ocampo Osorio +16

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Zhengru Fang +7Apr 27, 2026·also HKU, Hong Kong JC STEM Lab of Smart City

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Visual RL agents can recover near-perfect performance even under severe, dynamically changing visual corruptions by learning to disentangle task-relevant foreground from perturbation artifacts.

Zhengru Fang, Yu Guo, Fei Liu +5

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Apr 27, 2026·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information

Learning Illumination Control in Diffusion Models

Open-source diffusion models can now achieve state-of-the-art illumination control rivaling closed-source alternatives, thanks to a novel training pipeline and dataset.

Nishit Anand, Manan Suri, Christopher Metzler +2

Computer Vision Data Curation & Synthetic Data Open-Source Models & Weights

Y. Baba +1Apr 27, 2026

Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows

Achieve millisecond-level 3D point cloud reconstruction from a single image without sacrificing quality, blowing past diffusion model latency.

Y. Baba, Keiji Yanai

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Hai Wang +3Apr 27, 2026

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.

Hai Wang, Xiaocheng Yang, Mingzhi Dong +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Weixing Wang +7Apr 27, 2026

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.

Weixing Wang, Liudvikas Zekas, Anton Hackl +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Cheng-Han Lee +5Apr 27, 2026·also Meituan, Northeastern

Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

A new large-scale dataset of human-annotated video crops enables training models that adapt videos to different aspect ratios while preserving visual quality and meaning.

Cheng-Han Lee, Maniratnam Mandal, N. Birkbeck +3

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Apr 27, 2026·also New Laboratory of Pattern Recognition

You don't need billions of parameters to accurately ground GUI elements: GoClick, a 230M parameter model, matches the performance of much larger models, opening the door for on-device GUI agents.

Hongxin Li, Hongxin Li, Yuntao Chen +3

Computer Vision Multimodal Models Tool Use & Agents

Apr 26, 2026

Apr 26, 2026·also Cornell, Technion

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

Achieve surgical 3D edits without training: Prox-E lets you reshape objects with language by manipulating a compact set of geometric primitives.

Etai Sella, Hao Phung, Nitay Amiel +3

Computer Vision Multimodal Models Natural Language Processing

Zhen Ye +10Apr 26, 2026

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.

Zhen Ye, Xu Tan, Aoxiong Yin +8

Computer Vision Multimodal Models Speech & Audio

Pritesh JhaApr 26, 2026

RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

By reconstructing extractions and comparing them to the original document, RaV-IDP offers a grounded, label-free quality signal that dramatically improves the fidelity of intelligent document processing pipelines.

Pritesh Jha

Computer Vision Natural Language Processing Recommendation & Information Retrieval

Simone Mosco +2Apr 26, 2026

Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation

Current 3D anomaly detection struggles with real-world complexity, but this new approach directly models inlier feature distributions, achieving state-of-the-art results and offering a more robust solution.

Simone Mosco, Daniel Fusaro, Alberto Pretto

FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

Apr 24, 2026

Ze Chen +3Apr 24, 2026·also Communication University of China

FlowAnchor makes flow-based video editing robust to multi-object scenes and long sequences by stabilizing the editing signal, opening the door to more complex and controllable video manipulation.

Ze Chen, Lan Chen, Yuanhang Li +1

Computer Vision Training Efficiency & Optimization

Chengye Wang +3Apr 24, 2026

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.

Chengye Wang, Ling Fu, Zexi Kuang +1

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Apr 24, 2026·also University of Moratuwa

Vc-fes: viewpoint-conditioned feature selection for vehicle re-identification in thermal vision

Adapting RGB-pretrained ViTs with viewpoint-conditioned feature selection leaps ahead in thermal vehicle re-identification, outperforming existing methods by a significant margin.

Yasod Ginige, R. Gunasekara, D.C. Hewavitharana +3

PHOTON: Non-Invasive Optical Tracking of Key-Lever Motion in Historical Keyboard Instruments

Apr 23, 2026

Noah Jaffe +1Apr 23, 2026

Unlock the secrets of historical keyboard performance with PHOTON, a non-invasive optical tracking system that reveals the subtle interplay between performer input and instrument mechanics.

Noah Jaffe, J. Burgoyne

Soft Anisotropic Diagrams for Differentiable Image Representation

Independent ResearcherApr 23, 2026

SAD offers a surprisingly fast and accurate alternative to neural implicit representations for image compression and differentiable rendering, achieving 4-19x training speedups while outperforming state-of-the-art methods like Image-GS.

Laki Iinbor, Zhi-Chao Dou, Wojciech Matusik

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Max Defez +4Apr 23, 2026

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Unlock reusable architectures for climate data super-resolution: a single diffusion model now handles spatial upscaling from 1x to 25x and temporal upscaling from 1x to 6x.

Max Defez, F. Quarenghi, M. Vrac +2

Computer Vision Scientific Discovery & Drug Design

Timothy Murphy +2Apr 23, 2026

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Deepfakes betray themselves through subtle irregularities in the timing of facial movements, especially when expressing emotions, offering a new avenue for detection.

Timothy Murphy, J. Cook, H. Cuve

Computer Vision Interpretability & Mechanistic Interp

Wenxuan Bao +2Apr 23, 2026

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Ramen achieves robust test-time adaptation of VLMs in mixed-domain scenarios by selecting the right samples to adapt to, sidestepping the common pitfall of performance degradation when faced with diverse and inconsistent test data.

Wenxuan Bao, Yanjun Zhao, Xiyuan Yang

Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

M. Huber +2Apr 23, 2026

Persistent homology, when applied to eye-tracking data via novel filtration techniques, unlocks dyslexia detection performance exceeding traditional statistical methods.

M. Huber, D. Reich, Lena A. Jager

Computer Vision Natural Language Processing Scientific Discovery & Drug Design

Rishona Daniels +4Apr 23, 2026

On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

Volatile memristors can achieve state-of-the-art image classification accuracy in reservoir computing, even with significant device variability, suggesting they are a viable alternative to traditional CMOS.

Rishona Daniels, Duna Wattad, Ronny Ronen +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware

Yixuan Zhu +8Apr 23, 2026

VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

VARestorer distills a text-to-image VAR model into a one-step super-resolution network, achieving state-of-the-art image quality with a 10x speedup.

Yixuan Zhu, Shilin Ma, Haolin Wang +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Eghbal A. Hosseini +3Apr 23, 2026

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Stimuli that vision models agree on most strongly drive alignment with language models, doubling cross-modal convergence.

Eghbal A. Hosseini, Brian Cheung, E. Fedorenko +1

Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning

Prince Sultan UniversityApr 23, 2026

Stop punishing your model for disagreeing with corrupted data – Trust-SSL learns better representations by treating alignment with degraded views as a residual learning problem, not a hard constraint.

Wadii Boulila, A. Ammar, Bilel Benjdira +1

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Apr 23, 2026·also JD.com, Tencent AI

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

Learnable critics that evaluate the model's own GUI grounding proposals, rather than relying on static geometric heuristics, unlock substantial gains in accuracy.

Wenkai Wang, Xiyun Li, Hongcan Guo +5

Computer Vision Multimodal Models Tool Use & Agents

Sagar Dubey +1Apr 23, 2026

The Feedback Hamiltonian is the Score Function: A Diffusion-Model Framework for Quantum Trajectory Reversal

Quantum trajectory reversal, previously understood through specific feedback protocols, is now shown to be fundamentally linked to score-based diffusion, opening the door to ML-driven control in noisy, real-world quantum systems.

Sagar Dubey, A. John

Computer Vision Scientific Discovery & Drug Design

Boxun Xu +9Apr 23, 2026

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

Autoregressive video diffusion models can achieve faster decoding, lower memory footprint, and higher quality long-horizon generations by learning to attend to only the most salient spatiotemporal blocks.

Boxun Xu, Yuming Du, Zichang Liu +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Eleanor P. Wiesler +1Apr 23, 2026

Graph Neural Network-Informed Predictive Flows for Faster Ford-Fulkerson and PAC-Learnability

Forget repeatedly re-running inference on residual graphs: this GNN-guided Ford-Fulkerson algorithm learns edge importance probabilities to dramatically accelerate max-flow computation and image segmentation.

Eleanor P. Wiesler, Trace Baxley

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Umar Masud +4Apr 23, 2026

Addressing Image Authenticity When Cameras Use Generative AI

Your camera's AI could be subtly rewriting reality, but this method lets you reverse the changes and see the "unhallucinated" original.

Umar Masud, Abhijith Punnappurath, Luxi Zhao +2

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Safouane El Ghazouali +3Apr 23, 2026

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

A new synthetic aerial imagery dataset provides pixel-perfect depth, controlled illumination, and multi-scale imagery, unlocking joint research across geometric understanding, domain robustness, and resolution enhancement.

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger +1

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Bowen Liu +8Apr 23, 2026

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Mimicking how clinicians review capsule endoscopy videos—first screening, then weaving context, and finally converging evidence—yields surprisingly effective summarization of these ultra-long videos.

Bowen Liu, Li Yang, Shanshan Song +6

Computer Vision Natural Language Processing

Dat To-Thanh +9Apr 23, 2026

Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

Achieve high-fidelity image enhancement on mobile devices even after quantization by training a model that anticipates and adapts to low-precision representations.

Dat To-Thanh, Dat To-Thanh, N. Nguyen-Trong +7

Computer Vision Inference & Quantization Training Efficiency & Optimization

Zhen Zhang +6Apr 23, 2026

Causal Disentanglement for Full-Reference Image Quality Assessment

Achieve state-of-the-art image quality assessment by causally disentangling content and degradation, even in data-scarce domains where existing methods fail.

Zhen Zhang, Jielei Chu, Tian Zhang +4

Efficient Logic Gate Networks for Video Copy Detection

K. FojcikApr 23, 2026

Achieve competitive video copy detection accuracy with descriptors orders of magnitude smaller and inference speeds exceeding 11k samples per second by replacing floating-point operations with a learned Boolean circuit.

K. Fojcik

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Andrew ShinApr 23, 2026

AI-Gram: When Visual Agents Interact in a Social Network

LLM-driven visual agents form complex communication structures, but stubbornly resist stylistic convergence, revealing a fundamental tension between social expression and individual identity.

Andrew Shin

Computer Vision Multimodal Models Tool Use & Agents

B. Lim +3Apr 23, 2026

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.

B. Lim, Kyeonghyun Kim, Jung-Shin Yun +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Hao-Yu Hsu +4Apr 23, 2026·also UIUC

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Imagine reconstructing detailed human motion and scene layouts using just your smartwatch and earbuds – no cameras needed.

Hao-Yu Hsu, Tianhang Cheng, Jing Wen +2

Computer Vision Multimodal Models Robotics & Embodied AI

Minghao Yin +4Apr 23, 2026

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Forget generating static shapes – Sculpt4D now lets you efficiently sculpt dynamic 4D objects with state-of-the-art temporal coherence.

Minghao Yin, Wenbo Hu, Jiale Xu +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Apr 23, 2026

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Training a video reshooting model on internet-scale monocular videos is now possible, thanks to a clever self-supervision trick that generates multi-view training data from a single video.

Avinash Paliwal, Adithya Iyer, Shivin Yadav +2

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Ying Yang +2Apr 23, 2026

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

Domain shifts and novel classes at test time can be tamed by nudging features back towards the source distribution, even for out-of-distribution examples.

Ying Yang, Chaoqi Chen, Hui Huang

Computer Vision Training Efficiency & Optimization

L. Çağlar +2Apr 23, 2026

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Despite achieving comparable accuracy, humans and deep vision models exhibit fundamentally different error patterns, revealing distinct inductive biases that can be quantified through directional confusion analysis and Rate-Distortion geometry.

L. Çağlar, Pedro Mediano, Baihan Lin

Computer Vision Interpretability & Mechanistic Interp

Katharina Prasse +6Apr 23, 2026

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

VLMs can reliably reveal population-level trends in climate change discourse on social media, even when per-image accuracy is only moderate.

Katharina Prasse, Steffen Jung, Isaac Bravo +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 23, 2026

Grounding Video Reasoning in Physical Signals

Current video Q&A benchmarks can be fooled by textual regularities, failing to actually ground reasoning in the video's physical reality.

Alibay Osmanli, Zixu Cheng, Shaogang Gong

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Dan Fu +3Apr 23, 2026

Multiscale Super Resolution without Image Priors

Super-resolution is possible without image priors by cleverly combining low-resolution images at different scales, unlocking a stable inverse system for reconstruction.

Dan Fu, Gabby Litterio, Pedro Felzenszwalb +1

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li +5Apr 23, 2026

Multi-modification image retrieval is now possible: TEMA handles complex, real-world instructions that go beyond simple changes, outperforming existing methods on new datasets M-FashionIQ and M-CIRR.

Zixu Li, Yupeng Hu, Zhiheng Fu +3

Computer Vision Multimodal Models Recommendation & Information Retrieval

Guangkai Xu +6Apr 23, 2026

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Seemingly innocuous choices in loss functions and training regimes can significantly hinder visual geometry estimation, even for state-of-the-art methods.

Guangkai Xu, Huakang Geng, Huan Zheng +4

Computer Vision Data Curation & Synthetic Data

Apr 23, 2026·also PKU

DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

Turn your 3D Gaussian Splatting failures into features: DualSplat uses initial reconstruction artifacts to bootstrap robust scene representations in the presence of transient objects.

Xu Wang, Zhiru Wang, Shiyun Xie +2