Computer Vision - Weekly Roundup

Scalable inference of spatial regions and temporal signatures from time series

2w ago

Discovering spatial regions and their temporal signatures in massive time series data just got much faster and easier, thanks to a new method that scales log-linearly with the number of time series.

Jiayu Weng, Alec Kirkley

Computer Vision Natural Language Processing Scientific Discovery & Drug Design

Univ. Grenoble Alpes2w ago

Full-chip CMP modelling based on Fully Convolutional Network leveraging White Light Interferometry

Nanometer-accurate, full-chip CMP modeling is now possible with a fast, FCN-based approach that leapfrogs traditional, resource-intensive methods.

Jules Exbrayat, Renan Bouis, Elie Sezestre +4

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

2w ago·also Huawei

Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.

Hongxu Chen, Yanghao Wang, Bowei Zhu +6

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.

Mohamed Elhabebe, Ayman El-Baz

Computer Vision Constitutional AI & AI Ethics Multimodal Models

University of Science and Technology2w ago

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.

Huatian Zhang, Zhendong Mao, Lei Zhang +1

Computer Vision Multimodal Models RLHF & Preference Learning

Yifan F. Zhang +42w ago

Concurrence of Symmetry Breaking and Nonlocality Phase Transitions in Diffusion Models

Diffusion models' reliance on global information isn't just a quirk – it's fundamentally linked to the moment they commit to a specific semantic outcome.

Yifan F. Zhang, Fangjun Hu, Guangkuo Liu +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Berk Sezer +32w ago

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.

Berk Sezer, Ali Gorkem Kuccuk, Erol cSahin +1

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Yurui Du +32w ago

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

Achieve robust long-horizon visual control by adaptively balancing model-based lookahead with bootstrapping, enabling zero-shot transfer to real-world tasks with severe occlusions.

Yurui Du, Pinhao Song, Yutong Hu +1

Computer Vision Robotics & Embodied AI World Models & Planning

NUS2w ago

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

By embedding whole-slide images in a hybrid hyperbolic-Euclidean space, BatMIL unlocks superior classification performance compared to traditional Euclidean-only methods, revealing the importance of geometric awareness in capturing complex tissue organization.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

2w ago·also HKU

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.

Chuanzhi Xu, Boyu Wei, Haoxian Zhou +5

Computer Vision Eval Frameworks & Benchmarks

University of Nebraska-Lincoln2w ago·also Ohio State

Look Once, Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity

End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.

Avhishek Biswas, Apala Pramanik, Eylem Ekici +1

Computer Vision Multimodal Models Robotics & Embodied AI

Warsaw University of Technology2w ago·also Harvard, Massachusetts General Hospital, Warsaw

Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models

Hallucinations in diffusion models aren't just mode interpolation gone wrong, but instabilities on the model's manifold, and squashing its local intrinsic dimension can fix them.

Bartlomiej Sobieski, Matthew Tivnan, Dawid Płudowski +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Anju Rani +22w ago

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Michael Soprano +22w ago

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.

Michael Soprano, A. Cioci, Stefano Mizzaro

Computer Vision Constitutional AI & AI Ethics Speech & Audio

E. Denteh +32w ago

Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.

E. Denteh, Blessing Agyei Kyem, Joshua Kofi Asamoah +1

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Yuanzhi Wang +92w ago

Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng +7

Computer Vision Multimodal Models Natural Language Processing

Jingtao Liu +42w ago

Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.

Jingtao Liu, Peiliang Gong, Chuhang Zheng +2

Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness

Yichen Li +22w ago

Unsupervised object detection can now achieve category awareness, bridging the gap with supervised methods without needing any labeled data.

Yichen Li, Qiankun Liu, Ying Fu

Computer Vision Data Curation & Synthetic Data

2w ago·also ByteDance, SEU

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.

Zishen Qu, Xuesong Li, Haijian Gu +4

Computer Vision Multimodal Models Natural Language Processing

Vlad Vasilescu +22w ago

Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

Synthesizing high-resolution satellite imagery with geometric precision is now more efficient, thanks to a windowed cross-attention method that rivals existing techniques while better respecting geometric constraints.

Vlad Vasilescu, Daniela Faur, T. Costachioiu

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Binh Long Nguyen +42w ago

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.

Binh Long Nguyen, Kien Nguyen, S. Sridharan +2

Computer Vision Multimodal Models Robotics & Embodied AI

Jingtao Zhou +32w ago

SpecPL: Disentangling Spectral Granularity for Prompt Learning

Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.

Jingtao Zhou, Xirui Kang, Feiyang Huang +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Yuancheng Wei +92w ago

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.

Yuancheng Wei, Haojie Zhang, Linli Yao +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli +3

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Wei Liu +22w ago

Open-Source Image Editing Models Are Zero-Shot Vision Learners

Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.

Wei Liu, Jiaxin Lin, Rui Chen

Computer Vision Multimodal Models Open-Source Models & Weights

2w ago

FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

Forget backprop and memory lookups: FAAST lets you adapt models at test time with a single forward pass, matching fine-tuning accuracy with massive speed and memory gains.

Guangsheng Bao, Hongbo Zhang, Han Cui +2

Computer Vision Inference & Quantization Training Efficiency & Optimization

Jiangnan Zhu +32w ago

Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive Learning

Vol-Mark offers a way to protect sensitive 3D medical data from tampering and unauthorized copying with a reversible watermarking technique that maintains diagnostic accuracy.

Jiangnan Zhu, Yuntao Wang, Shengli Pan +1

Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification

2w ago·also HIT, PKU

Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.

Wen Wen, Hao Chen, Shiliang Zhang

Computer Vision Inference & Quantization Multimodal Models

2w ago·also NAVER Labs, NTU

Syn4D: A Multiview Synthetic 4D Dataset

Training on Syn4D could unlock breakthroughs in dynamic scene understanding, where current datasets fall short in providing dense, complete, and accurate geometric annotations.

Zeren Jiang, Yushi Lan, Yihang Luo +8

Computer Vision Data Curation & Synthetic Data

Friedrich-Alexander University2w ago·also Helmholtz, Imperial, Technical University Munich

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.

Bernhard Kainz, Johanna P Mueller, Matthew Baugh +1

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago

CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

Forget dataset-specific hacks: CPCANet achieves SOTA domain generalization by explicitly learning a structured, domain-invariant subspace with a differentiable CPCA layer.

Yu-Hsi Chen, Abd-Krim Seghouane

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

2w ago·also D height representation

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.

Yuan Wu, Zhiqiang Yan, Jiawei Lian +2

Computer Vision Multimodal Models Robotics & Embodied AI

CARIAD SE2w ago·also TU Berlin, Vision & Robotics GmbH

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.

Gasser Elazab, Frank Neuhaus, Tilman Koß +5

Computer Vision Multimodal Models Robotics & Embodied AI

Islamic University of Technology2w ago·also University of Louisiana at Lafayette

Few-Shot Learning Pipeline for Monkeypox Skin Disease Classification Using CNN Feature Extractors

Even with limited data, a simple combination of pre-trained CNN features and nearest-centroid classification can achieve surprisingly strong results in monkeypox skin disease classification.

Md. Safirur Rashid, Sabbir Ahmed, Muhammad Usama Islam +2

Computer Vision Data Curation & Synthetic Data Scientific Discovery & Drug Design

Honghu Pan +42w ago

Computer-Aided Design Generation by Cascaded Discrete Diffusion Model

Discrete diffusion, with carefully designed transition matrices for commands and parameters, unlocks superior CAD generation compared to continuous diffusion baselines.

Honghu Pan, Xiaoling Luo, Yongyong Chen +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Computer Vision

2w ago·also Naturalis Biodiversity Center, Vrije Universiteit

Exploring Clustering Capability of Inpainting Model Embeddings for Pattern-based Individual Identification

For more reliable animal identification, force your model to reconstruct masked skin patterns, and it will learn embeddings that better capture individual differences.

Jens van Bijsterveld, Daniele Avitabile, Fons J. Verbeek +1

3D Ultrasound-Derived Pseudo-CT Synthesis Using a Transformer-Augmented Residual Network for Real-Time Operator Guidance

Sapna Sachan +12w ago

Generate CT-like images from ultrasound with a transformer-augmented network, potentially reducing the need for harmful radiation exposure.

Sapna Sachan, Amulya Kumar Mahto

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Jinzhen Han +32w ago

Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation

Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.

Jinzhen Han, JinByeong Lee, Jisung Kim +1

Anny-Fit: All-Age Human Mesh Recovery

Laura Bravo-S'anchez +52w ago

Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.

Laura Bravo-S'anchez, M. Armando, Romain Br'egier +3

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago

Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling

Synthesizing realistic duet dance motions gets a boost from explicitly modeling inter-dancer contact, leading to significantly improved interaction fidelity and rhythmic synchronization.

Xuhai Chen, Zhi Cen, Huaijin Pi +3

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Qiming Li +112w ago

Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.

Qiming Li, Zekai Ye, Xiaocheng Feng +9

GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution

Kunyu Li +42w ago

Overlooked diagonal epipolar geometry holds the key to boosting light field super-resolution, as demonstrated by a new omnidirectional EPI Transformer.

Kunyu Li, Fei Wang, Lichao Zhang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Boyue Xu +32w ago

VL-UniTrack: A Unified Framework with Visual-Language Prompts for UAV-Ground Visual Tracking

Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.

Boyue Xu, Ruichao Hou, Tongwei Ren +1

Computer Vision Multimodal Models Robotics & Embodied AI

Liang Yao +82w ago

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.

Liang Yao, Fan Liu, Shengxiang Xu +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

C. Gentil +32w ago

Dr-PoGO: Direct Radar Pose-Graph Optimization

Radar SLAM can now achieve state-of-the-art performance via direct scan registration, eliminating the need for hand-engineered feature extraction and enabling robust localization in adverse weather.

C. Gentil, Weican Li, L. Brizi +1

Autonomous Laparoscope Control through Unified Mechanics-Based Representation of Multimodal Intraoperative Information

Xiaojian Li +82w ago

Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.

Xiaojian Li, Jin Fang, Yudong Shi +6

Computer Vision Multimodal Models Robotics & Embodied AI

Independent2w ago

AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset

Fine-grained analysis of user behavior on search engine results pages is now possible thanks to AllSERP, which adds exhaustive per-element annotations to the AdSERP dataset, covering organic results and widgets in addition to ads.

K. Andrew Edmonds

Computer Vision Data Curation & Synthetic Data Recommendation & Information Retrieval

Yanjia Chen +62w ago

Optimal Uncertainty-Aware Calibration for the AX=YB Problem

Hand-eye calibration gets a 67% accuracy boost in high-uncertainty scenarios thanks to a new optimization framework that cleverly avoids explicit uncertainty modeling.

Yanjia Chen, Xiangfei Li, Huan Zhao +4

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization

2w ago·also D observations. In contrast, D-Perception to

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.

Wei Li, Jizhihui Liu, Li Yixing +3

Computer Vision Multimodal Models Robotics & Embodied AI

Jieying Wang +32w ago

Optimize-at-Capture: Highly-adaptive Exposure Controlling for In-Vehicle Non-contact Heart-rate Monitoring

Standard camera auto-exposure is blind to the needs of remote heart-rate monitoring, but a new method closes the gap to enable robust in-vehicle driver monitoring.

Jieying Wang, Xinqi Cai, Caifeng Shan +1

Detecting Deepfakes via Hamiltonian Dynamics

Harry Cheng +52w ago

Escaping the endless cat-and-mouse game of deepfake detection may be possible by shifting from static pattern recognition to physics-inspired dynamical stability analysis, where real images are stable and deepfakes are not.

Harry Cheng, Ming-Hui Liu, Tianyi Wang +3

Computer Vision Red-Teaming & Adversarial Robustness

2w ago·also HIT

LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection

LEGO's modular design lets you detect deepfakes with 10x less training data and far fewer epochs, all by focusing on the unique fingerprints of each image generator.

Ran Ran, Jiwei Wei, Shuchang Zhou +2

Computer Vision Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Yupeng Gao +32w ago

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.

Yupeng Gao, Tianyu Li, Guoqing Wang +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Corresponding author2w ago

Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.

Yiran Qiao, Yiren Lu, Yunlai Zhou +5

Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

2w ago·also BIT, XJTU

By grounding temporal Gaussian aggregation in spatial voxels, Ground4D achieves state-of-the-art 4D reconstruction in challenging off-road environments where existing methods falter.

Shuo Wang, Jilin Mei, Fuyang Liu +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

2w ago·also Key Lab of MIMS, Northwestern, School of Computer Science and Engineering

A cross-modal network for facial expression recognition

Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.

Chunwei Tian, Jingyuan Xie, Qi Zhang +3

Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding

Shuo Liu +52w ago

Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.

Shuo Liu, Lei Shi, Haowen Liu +3

Computer Vision Multimodal Models Robotics & Embodied AI

Jiaming Hu +42w ago

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.

Jiaming Hu, Jiamu Bai, Haoyu Wang +2

Computer Vision Multimodal Models RLHF & Preference Learning

ZhiXin Sun2w ago

Example-Based Object Detection

Stop retraining your object detector every time it makes a mistake: EBOD learns from failure examples to prevent recurring errors in open-vocabulary object detection.

ZhiXin Sun

Computer Vision Natural Language Processing

Nand Kumar Mishra +42w ago

DALight-3D: A Lightweight 3D U-Net for Brain Tumor Segmentation from Multi-Modal MRI

Brain tumor segmentation gets a lightweight boost: DALight-3D achieves comparable accuracy to larger U-Nets with significantly fewer parameters.

Nand Kumar Mishra, Nandkishore Mishra, Dhruv Mishra +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

DAMO2w ago

High-Fidelity Single-Image Head Modeling with Industry-Grade Topology

Technical artists overwhelmingly prefer this new method for single-image head mesh reconstruction, finding it closest to industry-grade usability.

Computer Vision

Anagh Malik +52w ago

Velox: Learning Representations of 4D Geometry and Appearance

Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.

Anagh Malik, Dorian Chan, Xiaoming Zhao +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Lihua Zhou +82w ago

Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.

Lihua Zhou, Mao Ye, Xiatian Zhu +6

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng +42w ago

Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.

Muyao Peng, Shun Zou, Pei An +2

Computer Vision Multimodal Models Robotics & Embodied AI

Kaili Zheng +42w ago

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Explicitly modeling human-object interactions boosts multi-person human mesh recovery accuracy by up to 9.9%, showing that interaction context is key to understanding human pose and shape in complex scenes.

Kaili Zheng, Kaiwen Wang, Xun Zhu +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

2w ago

SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression

Mamba's linear complexity meets perceptual image compression, yielding a lightweight model that rivals GANs and diffusion models in visual quality while being far more efficient.

Jiaqian Zhang, Hao Wei, Chenyang Ge +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Zhiwei Yang +52w ago·also CAS

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.

Zhiwei Yang, Pengfei Song, Yucong Meng +3

Advancing Aesthetic Image Generation via Composition Transfer

Kai Zou +32w ago

Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.

Kai Zou, Zhiwei Zhao, Bin Liu +1

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

Huan Zhang +62w ago

Generating synthetic training data with multi-modal diffusion beats hand-crafting better detection architectures for PCB defect inspection.

Huan Zhang, Lianghong Tan, Yichu Xu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Tsinghua AI2w ago·also BAIR

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.

Xiaopei Zhu, Guanning Zeng, Zhanhao Hu +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

University of Campinas2w ago

Attention-Based Chaotic Self-Supervision for Medical Image Classification

Random masking in self-supervised learning can destroy crucial diagnostic features in medical images; instead, try inverting chaos.

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Yihan Lin +62w ago

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.

Yihan Lin, Haoyang Li, Yang Li +4

Computer Vision Multimodal Models Robotics & Embodied AI

Keunho Byeon +12w ago

HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction

Spatial transcriptomics predictions get a boost from HEXST, a Transformer that respects the hexagonal geometry of spot arrays and recovers gene-specific spatial heterogeneity.

Keunho Byeon, Jin Tae Kwak

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Kwkfk2w ago·also D characteristics with stronger viewpoint invariance, D matches to, The Hunan Engineering Research Center of

ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting

Alpha-blending, a core optimization in 3D Gaussian Splatting, subtly hobbles feature learning, but a geometry-weighted fusion approach can unlock more accurate and efficient visual localization.

Yingdong Gu, Shaocheng Yan, Zhenjun Zhao +4

Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation

Anjith George +12w ago

Resource-strapped edge devices can now achieve state-of-the-art face recognition across different sensing modalities thanks to a new lightweight CNN-Transformer architecture.

Anjith George, S´ebastien Marcel

Computer Vision Inference & Quantization

Andranik Sargsyan +12w ago

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.

Andranik Sargsyan, Shant Navasardyan

QuadBox: Accelerating 3D Gaussian Splatting with Geometry-Aware Boxes

Xinze Li +62w ago

3D Gaussian Splatting gets a nearly 2x speed boost thanks to a clever bounding box strategy that drastically reduces unnecessary tile intersection checks.

Xinze Li, Bohan Yang, Pengxu Chen +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

IMT Nord Europe2w ago·also Explain, University of Lille

ICPR 2026 Competition on Privacy-Preserving Person Re-Identification from Top-View RGB-Depth Camera (TVRID)

Top-view RGB-D person re-identification is surprisingly feasible, even across modalities, despite the inherent challenges of viewpoint and modality variations.

Raphaël Delécluse, Hazem Wannous, Laurent Guimas

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

2w ago

Low-Rank Adaptation of Geospatial Foundation Models for Wildfire Mapping Using Sentinel-2 Data

Forget full fine-tuning: LoRA lets you adapt Geospatial Foundation Models for wildfire mapping with comparable accuracy while only tweaking 1% of the parameters.

Ali Shibli, Andrea Nascetti, Yifang Ban

Computer Vision Open-Source Models & Weights Training Efficiency & Optimization

Joao B Florindo2w ago

Chaotic Contrastive Learning for Robust Texture Classification

Forget ImageNet – pre-training with chaotic augmentations yields surprisingly robust texture features, outperforming SOTA methods across diverse texture datasets.

Joao B Florindo

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Phenikaa University2w ago

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le +3

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

AeroVironment2w ago·also George Mason University

Existing restoration methods crumble when faced with the extreme geometric distortions caused by strong refractive warping, highlighting the need for robust new approaches benchmarked on this challenging dataset.

Maxim V. Shugaev, Md Reshad Ul Hoque, Bridget Kennedy +8

Computer Vision Eval Frameworks & Benchmarks

2w ago·also Stony Brook, University of Hawai'i at M ānoa, University of Hawai'i Cancer Center

External Validation of Deep Learning Models for BI-RADS Breast Density Prediction from Ultrasound Images

Turns out, deep learning models trained to predict breast density from ultrasound images generalize surprisingly well to external datasets, but still struggle with heterogeneously dense breasts.

Yuxuan Chen, Arianna Bunnell, Yanqi Xu +4