Computer Vision

Applications

Image recognition, object detection, segmentation, video understanding, and visual generation.

Keywords

computer visionobject detectionimage segmentationimage generationvideo understandingvisual recognitiondiffusion modelimage classification

Recent Papers

Feb 12, 2026

Luxembourg Institute of Science and Technology2d ago

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

This paper investigates the impact of data imbalance on deep learning-based software vulnerability detection using nine open-source datasets and two state-of-the-art DL models. The study confirms that data imbalance significantly affects model performance and that existing imbalance solutions exhibit varying effectiveness across datasets and evaluation metrics. The authors find that focal loss improves precision, mean false error and class-balanced loss improve recall, and random over-sampling improves F1-measure, but no single solution excels across all metrics.

Empirically demonstrates the significant impact of data imbalance on deep learning models for software vulnerability detection and evaluates the effectiveness of existing imbalance solutions across multiple datasets and metrics.

Yuejun Guo, Qiang Hu, Qiang Tang +12602.12038

Code Generation & Program SynthesisComputer Vision

2d ago

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

This paper introduces CSEval, a framework for evaluating the clinical semantic alignment between text prompts and generated medical images, addressing the limitations of existing metrics focused on realism and diversity. CSEval uses language models to identify semantic inconsistencies related to anatomical location and pathology, demonstrating a correlation with expert clinical judgment. The framework offers a scalable method for assessing the clinical reliability of generated medical images, crucial for the safe deployment of text-to-image models in healthcare.

Introduces CSEval, a novel language model-based framework, to evaluate the clinical semantic alignment between text prompts and generated medical images.

Robert Cronshaw, Konstantinos Vilouras, Steven McDonagh +12602.12004

Eval Frameworks & BenchmarksMultimodal ModelsComputer Vision

2d ago

Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging

This paper introduces a calibrated Bayesian deep learning framework for medical imaging decision support, addressing the critical need for reliable uncertainty quantification in AI-assisted diagnostics. The framework combines a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) during training, which penalizes high-confidence errors and low-confidence correct predictions, with a post-hoc Dual Temperature Scaling (DTS) strategy to refine the posterior distribution. Validated on pneumonia screening, diabetic retinopathy detection, and skin lesion identification, the approach demonstrates improved calibration, robust performance in data-scarce scenarios, and effectiveness on imbalanced datasets.

Introduces a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) and Dual Temperature Scaling (DTS) strategy to improve calibration and uncertainty quantification in Bayesian deep learning models for medical imaging.

Julián D. Arias-Londoño, Juan Ignacio Godino-Llorente2602.11973

Interpretability & Mechanistic InterpComputer VisionScientific Discovery & Drug Design

Shandong University2d ago

KAN-FIF: Spline-Parameterized Lightweight Physics-based Tropical Cyclone Estimation on Meteorological Satellite

The paper introduces KAN-FIF, a lightweight neural network architecture leveraging Kolmogorov-Arnold Networks (KANs) with spline parameterization to estimate tropical cyclone intensity from meteorological satellite data. KAN-FIF addresses the limitations of existing physics-guided models, which suffer from high parameter counts and computational inefficiency due to their inability to capture complex feature interactions. Experiments demonstrate that KAN-FIF achieves superior accuracy with significantly reduced parameters and faster inference speed compared to baseline models like Phy-CoCo, making it suitable for deployment on resource-constrained edge devices.

Introduces KAN-FIF, a novel and lightweight neural network architecture for tropical cyclone intensity estimation that integrates spline-parameterized KAN layers to efficiently capture complex feature interactions.

Runtong Wang, Chenrui Xu, Jinglin Zhang +12602.12117

Scientific Discovery & Drug DesignComputer VisionInference & Quantization

2d ago

DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

The paper introduces DynaHOI-Gym, a new online closed-loop platform for benchmarking hand motion generation in dynamic hand-object interaction (HOI) scenarios, addressing the limitations of existing benchmarks focused on static objects. To facilitate research, the authors release DynaHOI-10M, a large-scale dataset comprising 10 million frames and 180K hand capture trajectories with diverse target motions. They also present an observe-before-act (ObAct) baseline that leverages spatiotemporal attention, demonstrating improved location success rates in the dynamic HOI setting.

Introduces DynaHOI-Gym and DynaHOI-10M, a novel benchmark and dataset for evaluating hand motion generation in dynamic hand-object interaction scenarios.

Zhonghan Zhao, Hongwei Wang2602.11919

Eval Frameworks & BenchmarksRobotics & Embodied AIComputer Vision

2d ago

PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts

This paper introduces PuYun-LDM, a latent diffusion model for high-resolution ensemble weather forecasting that addresses the limited diffusability of LDMs in this domain. To improve diffusability, the authors incorporate weather-state evolution features encoded by a 3D Masked AutoEncoder (3D-MAE) as additional conditioning. They also propose a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy to adaptively regularize the spectral energy distribution of each variable, leading to improved performance compared to ENS at short lead times.

Introduces a novel latent diffusion model, PuYun-LDM, incorporating 3D-MAE conditioning and Variable-Aware Masked Frequency Modeling to enhance diffusability and improve high-resolution ensemble weather forecasting.

Lianjun Wu, Shengchen Zhu, Yuxuan Liu +52602.11807

Computer VisionScientific Discovery & Drug Design

2d ago

PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback

The paper introduces PosterOmni, a framework for generalized artistic poster creation that tackles both local image editing and global design creation aspects of the task. It achieves this by constructing a multi-task dataset, distilling knowledge from local and global expert models, and applying a unified reward feedback mechanism to align visual fidelity and aesthetic preferences. Experiments on the new PosterOmni-Bench demonstrate that PosterOmni outperforms existing open-source and proprietary systems in reference adherence, composition, and aesthetics.

Introduces a novel data-distillation-reward pipeline to unify local image editing and global design creation for generalized artistic poster generation.

Sixiang Chen, Jianyu Lai, Hengyu Shi +42602.12127

Multimodal ModelsComputer VisionNatural Language Processing

SSE2d ago

TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

The paper introduces TexSpot, a diffusion-based texture enhancement framework that addresses view-inconsistency and resolution limitations in 3D texture generation. TexSpot utilizes a novel 3D texture representation called Texlet, which combines point-based and UV-based approaches by encoding local texture patches with a 2D encoder and aggregating them with a 3D encoder. Experiments show that TexSpot significantly improves visual fidelity, geometric consistency, and robustness compared to existing state-of-the-art methods.

Introduces Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representations.

Ziteng Lu, Yushuang Wu, Xiaoyang Guo +22602.12157

Computer VisionMultimodal Models

Chongqing University2d ago

OMEGA-Avatar: One-shot Modeling of 360{\deg} Gaussian Avatars

The paper introduces OMEGA-Avatar, a novel feed-forward framework for generating animatable, 360-degree complete 3D Gaussian head avatars from a single image. To achieve this, the method incorporates a semantic-aware mesh deformation module for improved hair modeling and a multi-view feature splatting module to construct a shared canonical UV representation. Experiments demonstrate that OMEGA-Avatar outperforms existing methods in 360-degree full-head completeness and identity preservation.

Introduces a feed-forward framework, OMEGA-Avatar, that generates generalizable, 360-degree complete, and animatable 3D Gaussian head avatars from a single image by combining semantic-aware mesh deformation and multi-view feature splatting.

Yiqun Wang, Jun Xiao, Peter Wonka2602.11693

Multimodal ModelsComputer Vision

2d ago

Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

The paper introduces a supervise-assisted multi-modality fusion diffusion model (MFdiff) to restore standard-dose PET (SPET) images from low-dose PET (LPET) and MR images. MFdiff uses a multi-modality feature fusion module to learn optimized fusion features from MR images and incorporates these features as additional conditions in a diffusion model for iterative SPET image generation. A two-stage supervise-assisted learning strategy leverages both generalized priors from simulated data and specific priors from in-vivo data to improve restoration quality, demonstrating superior performance compared to existing methods.

Introduces a novel supervise-assisted multi-modality fusion diffusion model (MFdiff) that effectively leverages MR images to restore high-quality SPET images from LPET data by using a two-stage training approach.

Yingkai Zhang, Ye Tian, Yunyi Gao2602.11545

Multimodal ModelsComputer VisionScientific Discovery & Drug Design

2d ago

H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping

This paper introduces a novel steganalysis method for H.265/HEVC video that focuses on the coding unit (CU) block structure, addressing the limitations of existing methods that primarily analyze motion vectors, intra prediction modes, or transform coefficients. The method constructs a CU block-structure gradient map to capture changes in coding-unit partitioning and combines it with a block-level mapping representation of intra prediction modes to model steganographic perturbations. A tailored Transformer network, GradIPMFormer, is designed to enhance the perception of CU-level steganographic behaviors, demonstrating superior detection performance across multiple H.265/HEVC steganographic algorithms.

Introduces a CU block-level steganalysis method for H.265/HEVC video by constructing a CU block-structure gradient map and combining it with a block-level mapping representation of intra prediction modes, then training a custom Transformer network.

Ziwen He, Fei Peng2602.11547

Computer Vision

2d ago

Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

The paper introduces Spatial Chain-of-Thought (SCoT), a framework that combines the spatial reasoning of Multimodal Large Language Models (MLLMs) with the generative capabilities of diffusion models for improved image generation. SCoT trains a diffusion model on interleaved text-coordinate instructions to enhance layout awareness and uses MLLMs as planners to generate detailed layout plans. Experiments show SCoT achieves state-of-the-art performance on image generation benchmarks and excels in complex reasoning and image editing tasks.

Introduces Spatial Chain-of-Thought (SCoT), a novel plug-and-play framework that bridges MLLM reasoning and diffusion model generation by training the diffusion model with interleaved text-coordinate instructions and using MLLMs for spatial planning.

Wei Chen, Mingqiao Liu, Haojie Ding +52602.11980

Reasoning & Chain-of-ThoughtMultimodal ModelsComputer Vision

2d ago

Free Lunch for Stabilizing Rectified Flow Inversion

This paper addresses the instability issues in Rectified Flow (RF) inversion, which arise from accumulated approximation errors during the inversion process. They introduce Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities within a theoretically-motivated spherical Gaussian constraint. The authors further propose mimic-CFG, a velocity correction scheme for editing tasks that interpolates between the current velocity and its projection onto the historical average.

Introduces Proximal-Mean Inversion (PMI) and mimic-CFG, two novel, training-free methods to stabilize Rectified Flow inversion and improve image reconstruction and editing fidelity.

Chenru Wang, Beier Zhu2602.11850

Computer VisionTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

The paper investigates the use of 3D fractals generated via Iterated Function Systems (IFS) as a synthetic pre-training dataset for action recognition models. It identifies limitations in standard fractal generation methods, including slow speed and degenerate fractal structures, and finds that overly restrictive filtering hurts downstream performance. The authors introduce Targeted Smart Filtering, a novel method that significantly accelerates fractal generation (100x speedup) while maintaining fractal diversity, leading to improved action recognition performance after pre-training.

Introduces Targeted Smart Filtering, a novel method for generating high-quality 3D fractals for action recognition pre-training that balances generation speed and fractal diversity.

Marko Putak, T. Moeslund, J. B. Haurum2602.11810

Data Curation & Synthetic DataComputer Vision

2d ago

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

The paper introduces DreamID-Omni, a unified framework for human-centric audio-video generation, addressing tasks like reference-based generation, video editing, and audio-driven animation within a single model. It tackles the challenge of disentangling character identities and voice timbres by employing a Dual-Level Disentanglement strategy and a Symmetric Conditional Diffusion Transformer. Experimental results demonstrate state-of-the-art performance in video, audio, and audio-visual consistency, surpassing even proprietary commercial models.

Introduces a unified framework, DreamID-Omni, that achieves state-of-the-art performance on a range of human-centric audio-video generation tasks by disentangling identity and timbre control.

Xu Guo, Fulong Ye, Qichao Sun +42602.12160

Multimodal ModelsComputer VisionSpeech & Audio

2d ago

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

This paper investigates the use of local vision-language models (VLMs) to improve fine-grained activity recognition in newborn resuscitation videos, comparing them to a TimeSformer baseline. The authors explored zero-shot VLM strategies and fine-tuned VLMs with LoRA on a simulated dataset of 13.26 hours of video. Fine-tuning a local VLM with LoRA achieved an F1 score of 0.91, outperforming the TimeSformer baseline (0.70), suggesting the potential of VLMs for this task.

Demonstrates that fine-tuning local vision-language models with LoRA can significantly improve activity recognition in newborn resuscitation videos compared to a TimeSformer baseline.

Enrico Guerriero, Kjersti Engan, Oyvind Meinich-Bache2602.12002

Multimodal ModelsComputer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex

This paper introduces the Task-Amortized Variational Autoencoder (TAVAE), a generative model of V1 activity, to investigate how task-specific priors are learned and deployed in the visual cortex. TAVAE extends the VAE framework to efficiently acquire new tasks by reusing previously learned representations, allowing for flexible adaptation of priors. By comparing TAVAE's posterior distributions with large-scale V1 recordings from mice performing a discrimination task, the study demonstrates that the visual system can rapidly learn and utilize task-specific contextual priors, reflected in bimodal response profiles when task statistics are violated.

Introduces the Task-Amortized Variational Autoencoder (TAVAE), a novel VAE architecture that enables efficient learning of task-specific priors by amortizing learning across tasks.

Bal'azs Mesz'ena, Keith T. Murray, Julien Corbo +42602.11956

Computer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

This paper presents an anatomical analysis of text prompting within vision-language segmentation models, specifically SAM3, revealing significant redundancy in text encoder utilization. Based on these findings, they propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student. Experiments demonstrate that SAM3-LiteText reduces text encoder parameters by up to 88% while maintaining segmentation performance on image and video segmentation benchmarks.

Introduces SAM3-LiteText, a distilled MobileCLIP-based text encoder, to significantly reduce the computational and memory overhead of SAM3's text encoder without sacrificing segmentation accuracy.

Yuxuan Jiang, Duolikun Danier, Bin Zhu +22602.12173

Multimodal ModelsComputer VisionInference & Quantization

2d ago

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

This paper introduces HyperDet, a radar-only 3D object detection framework that enhances raw radar data to be more compatible with LiDAR-oriented detectors. HyperDet aggregates multi-frame, multi-radar data, applies geometry-aware cross-sensor validation, and uses a foreground-focused diffusion module trained with mixed radar-LiDAR supervision to densify object structures and lift radar attributes. Experiments on the MAN TruckScenes dataset demonstrate that HyperDet improves performance with VoxelNeXt and CenterPoint, reducing the gap between radar-only and LiDAR-based detection.

Proposes HyperDet, a novel radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud to improve performance with standard LiDAR-oriented detectors.

Runwei Guan, Fangqiang Ding2602.11554

Computer VisionRobotics & Embodied AI

American University of Armenia2d ago

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

The paper introduces PLESS, a pseudo-label enhancement strategy for weakly supervised segmentation using scribble annotations, addressing the limitations of noisy and incomplete supervision. PLESS leverages a hierarchical partitioning of the image into spatially coherent regions to propagate scribble information and refine pseudo-labels within these regions. Experiments on cardiac MRI datasets demonstrate that PLESS consistently improves segmentation accuracy across different scribble-supervised algorithms.

Introduces a novel pseudo-label enhancement strategy, PLESS, that leverages hierarchical image partitioning to improve the reliability and spatial consistency of pseudo-labels in weakly supervised segmentation.

Yeva Gabrielyan, Varduhi Yeghiazaryan, Irina Voiculescu Akian College of Science +82602.11628

Computer VisionTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

Decentralized Multi-Robot Obstacle Detection and Tracking in a Maritime Scenario

This paper introduces a decentralized multi-robot system for detecting and tracking floating containers in maritime environments, using a team of UAVs and an autonomous surface vessel. The system employs YOLOv8 and stereo disparity for visual detection on each UAV, followed by per-object Extended Kalman Filters (EKFs) for tracking with uncertainty-aware data association. Track summaries are exchanged and fused using covariance intersection to maintain consistency, and an information-driven assignment module optimizes target allocation and UAV viewpoints.

Introduces a decentralized multi-robot perception framework that combines visual detection, EKF tracking with uncertainty-aware data association, conservative track fusion via covariance intersection, and information-driven task assignment for robust maritime object tracking.

Muhammad Farhan Ahmed, Vincent Fr'emont2602.12012

Robotics & Embodied AIComputer VisionDistributed Systems & Hardware

2d ago

Human-Like Gaze Behavior in Social Robots: A Deep Learning Approach Integrating Human and Non-Human Stimuli

This paper introduces a deep learning approach to enhance social robot gaze behavior by incorporating both human and non-human stimuli, using LSTM and Transformer models trained on human gaze data collected via VR in simulated and real-world scenarios. The models predict human gaze direction with accuracies up to 72% and 71.6% for LSTM and Transformer respectively in real-world settings, outperforming existing methods by uniquely considering non-human stimuli. The system was deployed on a NAO robot and evaluated with 275 participants, demonstrating high user satisfaction.

Demonstrates a novel approach to predicting human gaze in social settings by integrating non-human stimuli and achieving state-of-the-art accuracy using LSTM and Transformer models.

Faezeh Vahedi, Morteza Memari, Ramtin Tabatabaei +12602.11648

Robotics & Embodied AIComputer VisionNatural Language Processing

2d ago

ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning

This paper introduces ViTaS, a visuomotor learning framework that leverages both visual and tactile information through Soft Fusion Contrastive Learning and a CVAE module to improve performance in manipulation tasks, especially in occluded scenarios. The Soft Fusion Contrastive Learning method is designed to better exploit the alignment and complementarity of visual and tactile representations. Experiments across 12 simulated and 3 real-world environments demonstrate that ViTaS significantly outperforms existing baselines, highlighting the benefits of the proposed fusion and contrastive learning approach.

Introduces Soft Fusion Contrastive Learning to effectively fuse visual and tactile information for visuomotor tasks, improving performance in occluded scenarios by explicitly modeling the complementary nature of the two modalities.

Zhecheng Yuan2602.11643

Robotics & Embodied AIMultimodal ModelsComputer Vision

2d ago

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

The paper introduces DeepGen 1.0, a 5B parameter unified multimodal model for image generation and editing, designed to be lightweight and efficient compared to larger models. To enhance semantic understanding in the compact model, they propose Stacked Channel Bridging (SCB) to extract and fuse hierarchical features from VLMs with learnable 'think tokens'. They also employ a three-stage data-centric training strategy, including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO, achieving state-of-the-art performance on benchmarks like WISE and UniREditBench while using only 50M training samples.

Introduces Stacked Channel Bridging (SCB), a novel deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to improve the generative backbone's semantic understanding and fine-grained control.

Ruihang Li, Feng Han, Wei Song +92602.12205

Multimodal ModelsComputer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

The paper introduces A$^{2}$V-SLP, an alignment-aware variational framework for sign language production that learns disentangled latent distributions for each articulator. This approach uses a disentangled VAE to encode sign pose sequences and extract articulator-specific mean and variance vectors, which then serve as distributional supervision for a non-autoregressive Transformer that predicts latent means and log-variances from text embeddings. By employing stochastic sampling and a gloss attention mechanism, A$^{2}$V-SLP achieves state-of-the-art back-translation performance and enhances motion realism in gloss-free sign language production.

Introduces an alignment-aware variational framework (A$^{2}$V-SLP) that learns disentangled latent distributions for sign language production, improving back-translation performance and motion realism.

Sumeyye Meryem Tacsyurek, Enis Mucahid .Iskender, H. Keles2602.11861

Multimodal ModelsComputer VisionSpeech & Audio

2d ago

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

The paper introduces LUVE, a latent-cascaded framework for ultra-high-resolution (UHR) video generation that tackles challenges in motion modeling, semantic planning, and detail synthesis. LUVE uses a three-stage architecture: low-resolution motion generation, latent upsampling, and high-resolution content refinement with dual frequency experts. Experiments demonstrate that LUVE achieves superior photorealism and content fidelity in UHR video generation compared to existing methods.

Introduces a novel latent-cascaded architecture with dual-frequency experts for generating ultra-high-resolution videos, improving both photorealism and content fidelity.

Chen Zhao, Jiawei Chen, Zhuoliang Kang +32602.11564

Computer VisionArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.

Introduces a novel object-aligned visual contrastive decoding method that masks salient visual features to mitigate object hallucinations in MLLMs.

Xudong Liu2602.11737

Multimodal ModelsComputer VisionEval Frameworks & Benchmarks

2d ago

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

This paper investigates gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 by generating 3,200 images from semantically neutral prompts. Using a pipeline involving color normalization, facial landmark masking, and skin tone quantification via Monk, PERLA, and Fitzpatrick scales, the study reveals a "default white" bias in both models. Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones, demonstrating that neutral prompts elicit polarized demographic defaults.

Quantifies and compares gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 using a rigorous colorimetric methodology.

R. Balestri2602.12133

Constitutional AI & AI EthicsMultimodal ModelsComputer Vision

2d ago

PathCRF: Ball-Free Soccer Event Detection via Possession Path Inference from Player Trajectories

The paper introduces PathCRF, a novel framework for detecting on-ball soccer events using only player tracking data by inferring possession paths. They model player trajectories as a fully connected dynamic graph and use a Conditional Random Field (CRF) to ensure logical consistency in the inferred possession sequence. Experiments demonstrate that PathCRF accurately detects possession paths and events, reducing the need for manual annotation.

Introduces a ball-free soccer event detection framework, PathCRF, that infers possession paths from player trajectories using a CRF to enforce logical consistency.

Hyunsung Kim, Kunhee Lee, Sang-Ki Ko +22602.12080

Computer VisionRobotics & Embodied AI

2d ago

Universal Diffusion-Based Probabilistic Downscaling

This paper introduces a universal diffusion-based downscaling framework that converts low-resolution weather forecasts into high-resolution probabilistic predictions without model-specific fine-tuning. A conditional diffusion model is trained on coarse-resolution inputs and high-resolution reanalysis targets and then applied in a zero-shot manner to deterministic forecasts from various weather models. The downscaled forecasts consistently improve upon the raw deterministic forecasts, with significant gains in probabilistic skill (CRPS) when evaluated against independent station observations.

Demonstrates a scalable, model-agnostic probabilistic interface for enhancing spatial resolution and uncertainty representation in operational weather forecasting pipelines via diffusion-based downscaling.

Roberto Molinaro, Niall Siegenheim, Henry Martin +42602.11893

Computer VisionScientific Discovery & Drug Design

2d ago

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

This paper explores test-time verification as a method to improve vision-language-action (VLA) alignment, addressing the "intention-action gap" in embodied instruction following. They demonstrate that scaling both rephrased instructions and generated actions at test time enhances sample diversity and improves action selection. The authors introduce CoVer, a contrastive verifier, and a hierarchical verification inference pipeline, showing that this verification approach outperforms scaling policy pre-training on the SIMPLER and PolaRiS benchmarks.

Demonstrates that scaling test-time verification, through diverse instruction rephrasing and action candidate generation, is more effective than scaling policy pre-training for vision-language-action alignment.

Jacky Kwok, Xilun Zhang, Azalia Mirhoseini +22602.12281

Robotics & Embodied AIMultimodal ModelsComputer Vision

2d ago

Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.

Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.

Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia2602.12236

Computer VisionTraining Efficiency & OptimizationInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions

This paper introduces Electrostatics-Inspired Surface Reconstruction (EISR), a novel method for 3D surface reconstruction that represents shapes as solutions to Poisson's equation. By drawing an analogy to electrostatics and utilizing Green's functions, the method derives a closed-form parametric expression for the implicit field. The key result is improved reconstruction of high-frequency details compared to existing SDF-based methods, even with limited shape priors, by leveraging the superposition principle of Poisson's equation solutions.

Formulates 3D surface reconstruction as solving Poisson's equation using Green's functions and superposition, enabling improved high-frequency detail recovery.

Diego Patiño, Knut Peterson, Kostas Daniilidis2602.11642

Computer VisionScientific Discovery & Drug Design

2d ago

EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

The paper introduces EmoSpace, a framework for emotion-aware content generation that learns dynamic emotion prototypes via vision-language alignment to enable fine-grained emotional control in VR content creation. EmoSpace uses a hierarchical emotion representation with learnable prototypes that evolve during training, allowing for control without explicit emotion labels. Experiments demonstrate EmoSpace's superior performance in emotional image outpainting, stylized generation, and emotional panorama generation, further validated by a user study comparing emotional perception in VR versus desktop environments.

Introduces a novel emotion-aware content generation framework, EmoSpace, that learns dynamic, interpretable emotion prototypes through vision-language alignment.

Zeyu Wang2602.11658

Multimodal ModelsComputer VisionNatural Language Processing

2d ago

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.

Introduces ImagineAgent, a novel agentic framework that leverages cognitive maps and generative tools to enhance OV-HOI comprehension by mitigating cross-modal hallucinations and occlusion ambiguity.

Zhenlong Yuan, Xiangyan Qu, Lei Sun +42602.11499

Tool Use & AgentsMultimodal ModelsComputer Vision

2d ago

Future Mining: Learning for Safety and Security

This paper proposes a Unified Smart Safety and Security Architecture for AI-driven mining environments, addressing challenges like poor illumination, GPS denial, and cyber-physical threats. The architecture integrates multimodal perception, secure federated learning, reinforcement learning, DTN communication, and energy-aware sensing to improve safety and security. The proposed system incorporates five core modules for miner localization, hazard understanding, federated robustness, and predictive maintenance.

Envisions and outlines a comprehensive architecture integrating diverse AI and security techniques to enhance safety and security in autonomous mining environments.

M. Jewel, S. Madria2602.11472

Robotics & Embodied AIComputer VisionRed-Teaming & Adversarial Robustness

College of Aerospace Science and Engineering2d ago

A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments

This paper introduces a high dynamic range (HDR) imaging system using a digital micromirror device (DMD) for spatial light modulation to address saturation issues in high-glare environments. The system autonomously segments regions and adaptively controls exposure using a DMD-based optical modulation unit and a computational imaging pipeline. Experimental results demonstrate a 127 dB dynamic range, a 78% reduction in strain error, and improved DIC positioning accuracy, validating the system's effectiveness in extreme lighting conditions.

Introduces a DMD-based adaptive modulation method for HDR imaging that significantly reduces saturation artifacts and improves measurement accuracy in high-glare environments.

Banglei Guan, Jing Tao, Liang Xu +22602.12044

Computer Vision

2d ago

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

The paper introduces Progressive Semantic Illusions, a vector sketching task where a single sketch transforms semantically through sequential stroke additions. They propose Stroke of Surprise, a generative framework using sequence-aware joint optimization with a dual-branch Score Distillation Sampling (SDS) mechanism to satisfy distinct semantic interpretations at different drawing stages. The method dynamically adjusts prefix strokes and uses a novel Overlay Loss to enforce spatial complementarity, achieving superior recognizability and illusion strength compared to baselines.

Introduces a sequence-aware joint optimization framework with a dual-branch SDS mechanism and Overlay Loss to generate vector sketches that progressively transform between distinct semantic interpretations.

Yu-Lun Liu2602.12280

Computer VisionMultimodal Models

2d ago

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

The paper introduces STVG-R1, a reinforcement learning framework for spatial-temporal video grounding (STVG) that addresses misalignment between textual descriptions and visual coordinates by reformulating per-frame coordinate prediction as instance-level identification using temporally consistent IDs embedded as visual prompts. This approach avoids the need for additional trainable modules and complex alignment strategies. By employing a task-driven reward to optimize temporal accuracy, spatial consistency, and structural format regularization, STVG-R1 achieves state-of-the-art results on multiple STVG benchmarks and demonstrates strong zero-shot generalization capabilities.

Introduces a novel visual prompting paradigm for spatial-temporal video grounding that reformulates coordinate prediction as instance-level identification and optimizes the process using reinforcement learning.

Xiaowen Zhang, Licheng Jiao, Qing Li2602.11730

Multimodal ModelsComputer VisionRLHF & Preference Learning

2d ago

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

The paper introduces AssetFormer, an autoregressive Transformer model for generating modular 3D assets from text descriptions, addressing the need for high-quality, diverse assets in the digital industry. AssetFormer models the generation of 3D assets as a sequence of primitives with constrained design parameters, adapting module sequencing and decoding techniques from language models. Experiments using real-world modular assets demonstrate the model's effectiveness in streamlining asset creation for professional development and UGC scenarios.

Introduces an autoregressive Transformer-based architecture, AssetFormer, for generating modular 3D assets from textual descriptions by modeling the asset as a sequence of primitives.

Lingting Zhu, Shengju Qian, Siwei Zhou +22602.12100

Architecture Design (Transformers, SSMs, MoE)Multimodal ModelsComputer Vision

2d ago

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

This paper introduces a lightweight RGB-D fusion framework to improve the efficiency and accuracy of Segment Anything Models (SAM). They augment EfficientViT-SAM with monocular depth priors generated by a pretrained estimator, fusing depth information mid-level with RGB features using a dedicated depth encoder. Training on only 11.2k samples, the proposed method outperforms EfficientViT-SAM, demonstrating the effectiveness of depth cues as geometric priors for segmentation.

Introduces a depth-aware fusion mechanism to enhance EfficientViT-SAM, enabling superior segmentation performance with significantly reduced training data.

Yiming Zhou, Xuenjie Xie, Panfeng Li +32602.11804

Computer VisionMultimodal ModelsTraining Efficiency & Optimization

2d ago

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

The paper identifies limitations in current Vision-Language-Action (VLA) models stemming from inadequate visual representations learned through language-image contrastive learning or image-based self-supervised learning. It proposes JEPA-VLA, a method that integrates video predictive embeddings (specifically V-JEPA 2) into VLAs to improve environment understanding and policy priors. Experiments on benchmarks like LIBERO and real-robot tasks demonstrate that JEPA-VLA significantly improves performance by leveraging the ability of video predictive embeddings to encode task-relevant temporal dynamics.

Introduces JEPA-VLA, a novel approach that adaptively integrates video predictive embeddings into existing VLAs to enhance environment understanding and policy priors.

Shangchen Miao, Ningya Feng, Jialong Wu +32602.11832

Multimodal ModelsRobotics & Embodied AIWorld Models & PlanningComputer Vision

2d ago

Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

This paper introduces a semantically conditioned latent diffusion model (LDM) for synthesizing arterial-phase cerebral digital subtraction angiography (DSA) images, addressing the scarcity of DSA data due to its invasive nature. The LDM is conditioned on text embeddings representing anatomical circulation (anterior/posterior) and C-arm positions, enabling explicit control over the synthesis process. Evaluation by medical experts showed high clinical realism with Likert scores of 3.1-3.3 and a low Fréchet inception distance (FID) of 15.27, demonstrating the potential for generating realistic synthetic DSAs for research and training.

Demonstrates semantically controlled synthesis of realistic cerebral DSA images using a latent diffusion model conditioned on anatomical and geometric parameters.

Qiwen Xu, David Rugamer, H. Wenz +42602.11703

Computer VisionData Curation & Synthetic DataScientific Discovery & Drug Design

2d ago

U-DAVI: Uncertainty-Aware Diffusion-Prior-Based Amortized Variational Inference for Image Reconstruction

This paper introduces U-DAVI, an uncertainty-aware amortized variational inference framework for image reconstruction that leverages diffusion priors. By injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, U-DAVI focuses learning on uncertain regions, improving reconstruction quality. Experiments on deblurring and super-resolution tasks demonstrate that U-DAVI achieves competitive or superior performance compared to existing diffusion-based methods, while maintaining computational efficiency.

Introduces an uncertainty-aware training strategy for amortized variational inference with diffusion priors, enabling improved image reconstruction by focusing learning on uncertain regions.

Ayush Varshney, K. Bouman2602.11704

Computer VisionTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

This paper addresses the limited generalization of diffusion-based policies in semantic manipulation by introducing bounding-box instructions to guide the policy's attention to target objects. They developed Label-UMI, a handheld segmentation device with an automated annotation pipeline, to efficiently collect demonstration data with semantic labels. Through real-world experiments, the authors demonstrated improved generalization and adaptability using a semantic-motion-decoupled framework and revealed a power-law relationship between generalization performance and the number of bounding-box objects, achieving 85% success rates across various tasks.

Demonstrates that bounding-box guided diffusion policies, trained on large-scale datasets collected with a novel handheld segmentation device, significantly improve generalization in semantic manipulation tasks and exhibit a power-law scaling relationship.

Yihao Wu, Shoujie Li, Mingliang Zhou +22602.11885

Robotics & Embodied AIScaling Laws & Emergent AbilitiesComputer Vision

2d ago

U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction

The paper introduces Transform Domain Fusion UNet (TD-FusionUNet), a lightweight deep learning model for next-day wildfire spread prediction using multimodal satellite data. The model incorporates trainable Hadamard Transform and Discrete Cosine Transform layers to capture frequency components in orthogonalized latent spaces, along with custom preprocessing techniques for sparse pre-fire masks. Evaluated on the Next-Day Wildfire Spread and WildfireSpreadTS datasets, TD-FusionUNet achieves an F1 score of 0.591 with only 370k parameters, surpassing a ResNet18-based UNet baseline in the WildfireSpreadTS dataset.

Introduces a novel U-Net architecture, TD-FusionUNet, that leverages trainable Hadamard and Discrete Cosine Transforms to efficiently capture frequency components in latent spaces for improved wildfire spread prediction.

Shuaiang Rong, Adam Watts, A. Cetin2602.11672

Multimodal ModelsComputer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

Egocentric Gaze Estimation via Neck-Mounted Camera

This paper introduces neck-mounted egocentric gaze estimation and presents a new dataset of 4 hours of video from 8 participants performing daily activities. They evaluate a transformer-based gaze estimation model (GLC) and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach with a geometry-aware loss. The auxiliary classification task improves performance, while the co-learning approach does not.

Introduces a new task of neck-mounted egocentric gaze estimation and provides a corresponding dataset to facilitate research in this area.

Yoichi Sato2602.11669

Computer VisionRobotics & Embodied AI

2d ago

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.

Introduces a hierarchical subspace latent diffusion model (SLD-L2S) for lip-to-speech synthesis that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, enabling the incorporation of SLM and semantic losses via reparameterized flow matching.

Xiaodong Li2602.11477

Multimodal ModelsSpeech & AudioComputer Vision

2d ago

GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction

This paper introduces GR-Diffusion, a novel framework for 3D whole-body PET reconstruction that combines a 3D Gaussian representation (GR) with diffusion models. GR is used to generate a reference 3D PET image from projection data, providing a geometric prior to guide the diffusion process. A hierarchical guidance mechanism refines local details and corrects deviations, enabling the diffusion model to integrate the GR prior and recover sub-voxel information.

Introduces a GR-Diffusion framework that leverages 3D Gaussian representations to guide diffusion models for improved 3D whole-body PET reconstruction, achieving state-of-the-art performance.

Mengxiao Geng, Ran Hong, Qiegen Liu2602.11653

Computer VisionScientific Discovery & Drug Design

2d ago

A Large Language Model for Disaster Structural Reconnaissance Summarization

This paper introduces LLM-DRS, a novel Large Language Model (LLM)-based framework for disaster reconnaissance summarization in structural health monitoring. The framework integrates vision data and metadata from on-site investigations, using deep convolutional neural networks to extract key attributes like damage state and material type. The extracted data, along with carefully designed prompts, are then fed into an LLM to generate summary reports for individual structures or affected regions.

Introduces a novel LLM-based framework, LLM-DRS, that automates the generation of structural reconnaissance reports by integrating vision data, metadata, and deep learning-extracted attributes.

Yuqing Gao, Guanren Zhou, K. Mosalam2602.11588

Natural Language ProcessingComputer VisionMultimodal Models

Lattice is designed for desktop

Computer Vision

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Computer Vision

Keywords

Top Labs in This Topic

Recent Papers

Search