Computer Vision
ApplicationsImage recognition, object detection, segmentation, video understanding, and visual generation.
Keywords
Top Labs in This Topic
Recent Papers
This paper investigates the impact of data imbalance on deep learning-based software vulnerability detection using nine open-source datasets and two state-of-the-art DL models. The study confirms that data imbalance significantly affects model performance and that existing imbalance solutions exhibit varying effectiveness across datasets and evaluation metrics. The authors find that focal loss improves precision, mean false error and class-balanced loss improve recall, and random over-sampling improves F1-measure, but no single solution excels across all metrics.
Empirically demonstrates the significant impact of data imbalance on deep learning models for software vulnerability detection and evaluates the effectiveness of existing imbalance solutions across multiple datasets and metrics.
This paper introduces CSEval, a framework for evaluating the clinical semantic alignment between text prompts and generated medical images, addressing the limitations of existing metrics focused on realism and diversity. CSEval uses language models to identify semantic inconsistencies related to anatomical location and pathology, demonstrating a correlation with expert clinical judgment. The framework offers a scalable method for assessing the clinical reliability of generated medical images, crucial for the safe deployment of text-to-image models in healthcare.
Introduces CSEval, a novel language model-based framework, to evaluate the clinical semantic alignment between text prompts and generated medical images.
This paper introduces a calibrated Bayesian deep learning framework for medical imaging decision support, addressing the critical need for reliable uncertainty quantification in AI-assisted diagnostics. The framework combines a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) during training, which penalizes high-confidence errors and low-confidence correct predictions, with a post-hoc Dual Temperature Scaling (DTS) strategy to refine the posterior distribution. Validated on pneumonia screening, diabetic retinopathy detection, and skin lesion identification, the approach demonstrates improved calibration, robust performance in data-scarce scenarios, and effectiveness on imbalanced datasets.
Introduces a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) and Dual Temperature Scaling (DTS) strategy to improve calibration and uncertainty quantification in Bayesian deep learning models for medical imaging.
The paper introduces KAN-FIF, a lightweight neural network architecture leveraging Kolmogorov-Arnold Networks (KANs) with spline parameterization to estimate tropical cyclone intensity from meteorological satellite data. KAN-FIF addresses the limitations of existing physics-guided models, which suffer from high parameter counts and computational inefficiency due to their inability to capture complex feature interactions. Experiments demonstrate that KAN-FIF achieves superior accuracy with significantly reduced parameters and faster inference speed compared to baseline models like Phy-CoCo, making it suitable for deployment on resource-constrained edge devices.
Introduces KAN-FIF, a novel and lightweight neural network architecture for tropical cyclone intensity estimation that integrates spline-parameterized KAN layers to efficiently capture complex feature interactions.
The paper introduces DynaHOI-Gym, a new online closed-loop platform for benchmarking hand motion generation in dynamic hand-object interaction (HOI) scenarios, addressing the limitations of existing benchmarks focused on static objects. To facilitate research, the authors release DynaHOI-10M, a large-scale dataset comprising 10 million frames and 180K hand capture trajectories with diverse target motions. They also present an observe-before-act (ObAct) baseline that leverages spatiotemporal attention, demonstrating improved location success rates in the dynamic HOI setting.
Introduces DynaHOI-Gym and DynaHOI-10M, a novel benchmark and dataset for evaluating hand motion generation in dynamic hand-object interaction scenarios.
This paper introduces PuYun-LDM, a latent diffusion model for high-resolution ensemble weather forecasting that addresses the limited diffusability of LDMs in this domain. To improve diffusability, the authors incorporate weather-state evolution features encoded by a 3D Masked AutoEncoder (3D-MAE) as additional conditioning. They also propose a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy to adaptively regularize the spectral energy distribution of each variable, leading to improved performance compared to ENS at short lead times.
Introduces a novel latent diffusion model, PuYun-LDM, incorporating 3D-MAE conditioning and Variable-Aware Masked Frequency Modeling to enhance diffusability and improve high-resolution ensemble weather forecasting.
The paper introduces PosterOmni, a framework for generalized artistic poster creation that tackles both local image editing and global design creation aspects of the task. It achieves this by constructing a multi-task dataset, distilling knowledge from local and global expert models, and applying a unified reward feedback mechanism to align visual fidelity and aesthetic preferences. Experiments on the new PosterOmni-Bench demonstrate that PosterOmni outperforms existing open-source and proprietary systems in reference adherence, composition, and aesthetics.
Introduces a novel data-distillation-reward pipeline to unify local image editing and global design creation for generalized artistic poster generation.
The paper introduces TexSpot, a diffusion-based texture enhancement framework that addresses view-inconsistency and resolution limitations in 3D texture generation. TexSpot utilizes a novel 3D texture representation called Texlet, which combines point-based and UV-based approaches by encoding local texture patches with a 2D encoder and aggregating them with a 3D encoder. Experiments show that TexSpot significantly improves visual fidelity, geometric consistency, and robustness compared to existing state-of-the-art methods.
Introduces Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representations.
The paper introduces OMEGA-Avatar, a novel feed-forward framework for generating animatable, 360-degree complete 3D Gaussian head avatars from a single image. To achieve this, the method incorporates a semantic-aware mesh deformation module for improved hair modeling and a multi-view feature splatting module to construct a shared canonical UV representation. Experiments demonstrate that OMEGA-Avatar outperforms existing methods in 360-degree full-head completeness and identity preservation.
Introduces a feed-forward framework, OMEGA-Avatar, that generates generalizable, 360-degree complete, and animatable 3D Gaussian head avatars from a single image by combining semantic-aware mesh deformation and multi-view feature splatting.
The paper introduces a supervise-assisted multi-modality fusion diffusion model (MFdiff) to restore standard-dose PET (SPET) images from low-dose PET (LPET) and MR images. MFdiff uses a multi-modality feature fusion module to learn optimized fusion features from MR images and incorporates these features as additional conditions in a diffusion model for iterative SPET image generation. A two-stage supervise-assisted learning strategy leverages both generalized priors from simulated data and specific priors from in-vivo data to improve restoration quality, demonstrating superior performance compared to existing methods.
Introduces a novel supervise-assisted multi-modality fusion diffusion model (MFdiff) that effectively leverages MR images to restore high-quality SPET images from LPET data by using a two-stage training approach.
This paper introduces a novel steganalysis method for H.265/HEVC video that focuses on the coding unit (CU) block structure, addressing the limitations of existing methods that primarily analyze motion vectors, intra prediction modes, or transform coefficients. The method constructs a CU block-structure gradient map to capture changes in coding-unit partitioning and combines it with a block-level mapping representation of intra prediction modes to model steganographic perturbations. A tailored Transformer network, GradIPMFormer, is designed to enhance the perception of CU-level steganographic behaviors, demonstrating superior detection performance across multiple H.265/HEVC steganographic algorithms.
Introduces a CU block-level steganalysis method for H.265/HEVC video by constructing a CU block-structure gradient map and combining it with a block-level mapping representation of intra prediction modes, then training a custom Transformer network.
The paper introduces Spatial Chain-of-Thought (SCoT), a framework that combines the spatial reasoning of Multimodal Large Language Models (MLLMs) with the generative capabilities of diffusion models for improved image generation. SCoT trains a diffusion model on interleaved text-coordinate instructions to enhance layout awareness and uses MLLMs as planners to generate detailed layout plans. Experiments show SCoT achieves state-of-the-art performance on image generation benchmarks and excels in complex reasoning and image editing tasks.
Introduces Spatial Chain-of-Thought (SCoT), a novel plug-and-play framework that bridges MLLM reasoning and diffusion model generation by training the diffusion model with interleaved text-coordinate instructions and using MLLMs for spatial planning.
This paper addresses the instability issues in Rectified Flow (RF) inversion, which arise from accumulated approximation errors during the inversion process. They introduce Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities within a theoretically-motivated spherical Gaussian constraint. The authors further propose mimic-CFG, a velocity correction scheme for editing tasks that interpolates between the current velocity and its projection onto the historical average.
Introduces Proximal-Mean Inversion (PMI) and mimic-CFG, two novel, training-free methods to stabilize Rectified Flow inversion and improve image reconstruction and editing fidelity.
The paper investigates the use of 3D fractals generated via Iterated Function Systems (IFS) as a synthetic pre-training dataset for action recognition models. It identifies limitations in standard fractal generation methods, including slow speed and degenerate fractal structures, and finds that overly restrictive filtering hurts downstream performance. The authors introduce Targeted Smart Filtering, a novel method that significantly accelerates fractal generation (100x speedup) while maintaining fractal diversity, leading to improved action recognition performance after pre-training.
Introduces Targeted Smart Filtering, a novel method for generating high-quality 3D fractals for action recognition pre-training that balances generation speed and fractal diversity.
The paper introduces DreamID-Omni, a unified framework for human-centric audio-video generation, addressing tasks like reference-based generation, video editing, and audio-driven animation within a single model. It tackles the challenge of disentangling character identities and voice timbres by employing a Dual-Level Disentanglement strategy and a Symmetric Conditional Diffusion Transformer. Experimental results demonstrate state-of-the-art performance in video, audio, and audio-visual consistency, surpassing even proprietary commercial models.
Introduces a unified framework, DreamID-Omni, that achieves state-of-the-art performance on a range of human-centric audio-video generation tasks by disentangling identity and timbre control.
This paper investigates the use of local vision-language models (VLMs) to improve fine-grained activity recognition in newborn resuscitation videos, comparing them to a TimeSformer baseline. The authors explored zero-shot VLM strategies and fine-tuned VLMs with LoRA on a simulated dataset of 13.26 hours of video. Fine-tuning a local VLM with LoRA achieved an F1 score of 0.91, outperforming the TimeSformer baseline (0.70), suggesting the potential of VLMs for this task.
Demonstrates that fine-tuning local vision-language models with LoRA can significantly improve activity recognition in newborn resuscitation videos compared to a TimeSformer baseline.
This paper introduces the Task-Amortized Variational Autoencoder (TAVAE), a generative model of V1 activity, to investigate how task-specific priors are learned and deployed in the visual cortex. TAVAE extends the VAE framework to efficiently acquire new tasks by reusing previously learned representations, allowing for flexible adaptation of priors. By comparing TAVAE's posterior distributions with large-scale V1 recordings from mice performing a discrimination task, the study demonstrates that the visual system can rapidly learn and utilize task-specific contextual priors, reflected in bimodal response profiles when task statistics are violated.
Introduces the Task-Amortized Variational Autoencoder (TAVAE), a novel VAE architecture that enables efficient learning of task-specific priors by amortizing learning across tasks.
This paper presents an anatomical analysis of text prompting within vision-language segmentation models, specifically SAM3, revealing significant redundancy in text encoder utilization. Based on these findings, they propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student. Experiments demonstrate that SAM3-LiteText reduces text encoder parameters by up to 88% while maintaining segmentation performance on image and video segmentation benchmarks.
Introduces SAM3-LiteText, a distilled MobileCLIP-based text encoder, to significantly reduce the computational and memory overhead of SAM3's text encoder without sacrificing segmentation accuracy.
This paper introduces HyperDet, a radar-only 3D object detection framework that enhances raw radar data to be more compatible with LiDAR-oriented detectors. HyperDet aggregates multi-frame, multi-radar data, applies geometry-aware cross-sensor validation, and uses a foreground-focused diffusion module trained with mixed radar-LiDAR supervision to densify object structures and lift radar attributes. Experiments on the MAN TruckScenes dataset demonstrate that HyperDet improves performance with VoxelNeXt and CenterPoint, reducing the gap between radar-only and LiDAR-based detection.
Proposes HyperDet, a novel radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud to improve performance with standard LiDAR-oriented detectors.
The paper introduces PLESS, a pseudo-label enhancement strategy for weakly supervised segmentation using scribble annotations, addressing the limitations of noisy and incomplete supervision. PLESS leverages a hierarchical partitioning of the image into spatially coherent regions to propagate scribble information and refine pseudo-labels within these regions. Experiments on cardiac MRI datasets demonstrate that PLESS consistently improves segmentation accuracy across different scribble-supervised algorithms.
Introduces a novel pseudo-label enhancement strategy, PLESS, that leverages hierarchical image partitioning to improve the reliability and spatial consistency of pseudo-labels in weakly supervised segmentation.
This paper introduces a decentralized multi-robot system for detecting and tracking floating containers in maritime environments, using a team of UAVs and an autonomous surface vessel. The system employs YOLOv8 and stereo disparity for visual detection on each UAV, followed by per-object Extended Kalman Filters (EKFs) for tracking with uncertainty-aware data association. Track summaries are exchanged and fused using covariance intersection to maintain consistency, and an information-driven assignment module optimizes target allocation and UAV viewpoints.
Introduces a decentralized multi-robot perception framework that combines visual detection, EKF tracking with uncertainty-aware data association, conservative track fusion via covariance intersection, and information-driven task assignment for robust maritime object tracking.
This paper introduces a deep learning approach to enhance social robot gaze behavior by incorporating both human and non-human stimuli, using LSTM and Transformer models trained on human gaze data collected via VR in simulated and real-world scenarios. The models predict human gaze direction with accuracies up to 72% and 71.6% for LSTM and Transformer respectively in real-world settings, outperforming existing methods by uniquely considering non-human stimuli. The system was deployed on a NAO robot and evaluated with 275 participants, demonstrating high user satisfaction.
Demonstrates a novel approach to predicting human gaze in social settings by integrating non-human stimuli and achieving state-of-the-art accuracy using LSTM and Transformer models.
This paper introduces ViTaS, a visuomotor learning framework that leverages both visual and tactile information through Soft Fusion Contrastive Learning and a CVAE module to improve performance in manipulation tasks, especially in occluded scenarios. The Soft Fusion Contrastive Learning method is designed to better exploit the alignment and complementarity of visual and tactile representations. Experiments across 12 simulated and 3 real-world environments demonstrate that ViTaS significantly outperforms existing baselines, highlighting the benefits of the proposed fusion and contrastive learning approach.
Introduces Soft Fusion Contrastive Learning to effectively fuse visual and tactile information for visuomotor tasks, improving performance in occluded scenarios by explicitly modeling the complementary nature of the two modalities.
The paper introduces DeepGen 1.0, a 5B parameter unified multimodal model for image generation and editing, designed to be lightweight and efficient compared to larger models. To enhance semantic understanding in the compact model, they propose Stacked Channel Bridging (SCB) to extract and fuse hierarchical features from VLMs with learnable 'think tokens'. They also employ a three-stage data-centric training strategy, including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO, achieving state-of-the-art performance on benchmarks like WISE and UniREditBench while using only 50M training samples.
Introduces Stacked Channel Bridging (SCB), a novel deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to improve the generative backbone's semantic understanding and fine-grained control.
The paper introduces A$^{2}$V-SLP, an alignment-aware variational framework for sign language production that learns disentangled latent distributions for each articulator. This approach uses a disentangled VAE to encode sign pose sequences and extract articulator-specific mean and variance vectors, which then serve as distributional supervision for a non-autoregressive Transformer that predicts latent means and log-variances from text embeddings. By employing stochastic sampling and a gloss attention mechanism, A$^{2}$V-SLP achieves state-of-the-art back-translation performance and enhances motion realism in gloss-free sign language production.
Introduces an alignment-aware variational framework (A$^{2}$V-SLP) that learns disentangled latent distributions for sign language production, improving back-translation performance and motion realism.
The paper introduces LUVE, a latent-cascaded framework for ultra-high-resolution (UHR) video generation that tackles challenges in motion modeling, semantic planning, and detail synthesis. LUVE uses a three-stage architecture: low-resolution motion generation, latent upsampling, and high-resolution content refinement with dual frequency experts. Experiments demonstrate that LUVE achieves superior photorealism and content fidelity in UHR video generation compared to existing methods.
Introduces a novel latent-cascaded architecture with dual-frequency experts for generating ultra-high-resolution videos, improving both photorealism and content fidelity.
The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.
Introduces a novel object-aligned visual contrastive decoding method that masks salient visual features to mitigate object hallucinations in MLLMs.
This paper investigates gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 by generating 3,200 images from semantically neutral prompts. Using a pipeline involving color normalization, facial landmark masking, and skin tone quantification via Monk, PERLA, and Fitzpatrick scales, the study reveals a "default white" bias in both models. Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones, demonstrating that neutral prompts elicit polarized demographic defaults.
Quantifies and compares gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 using a rigorous colorimetric methodology.
The paper introduces PathCRF, a novel framework for detecting on-ball soccer events using only player tracking data by inferring possession paths. They model player trajectories as a fully connected dynamic graph and use a Conditional Random Field (CRF) to ensure logical consistency in the inferred possession sequence. Experiments demonstrate that PathCRF accurately detects possession paths and events, reducing the need for manual annotation.
Introduces a ball-free soccer event detection framework, PathCRF, that infers possession paths from player trajectories using a CRF to enforce logical consistency.
This paper introduces a universal diffusion-based downscaling framework that converts low-resolution weather forecasts into high-resolution probabilistic predictions without model-specific fine-tuning. A conditional diffusion model is trained on coarse-resolution inputs and high-resolution reanalysis targets and then applied in a zero-shot manner to deterministic forecasts from various weather models. The downscaled forecasts consistently improve upon the raw deterministic forecasts, with significant gains in probabilistic skill (CRPS) when evaluated against independent station observations.
Demonstrates a scalable, model-agnostic probabilistic interface for enhancing spatial resolution and uncertainty representation in operational weather forecasting pipelines via diffusion-based downscaling.
This paper explores test-time verification as a method to improve vision-language-action (VLA) alignment, addressing the "intention-action gap" in embodied instruction following. They demonstrate that scaling both rephrased instructions and generated actions at test time enhances sample diversity and improves action selection. The authors introduce CoVer, a contrastive verifier, and a hierarchical verification inference pipeline, showing that this verification approach outperforms scaling policy pre-training on the SIMPLER and PolaRiS benchmarks.
Demonstrates that scaling test-time verification, through diverse instruction rephrasing and action candidate generation, is more effective than scaling policy pre-training for vision-language-action alignment.
This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.
Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.
This paper introduces Electrostatics-Inspired Surface Reconstruction (EISR), a novel method for 3D surface reconstruction that represents shapes as solutions to Poisson's equation. By drawing an analogy to electrostatics and utilizing Green's functions, the method derives a closed-form parametric expression for the implicit field. The key result is improved reconstruction of high-frequency details compared to existing SDF-based methods, even with limited shape priors, by leveraging the superposition principle of Poisson's equation solutions.
Formulates 3D surface reconstruction as solving Poisson's equation using Green's functions and superposition, enabling improved high-frequency detail recovery.
The paper introduces EmoSpace, a framework for emotion-aware content generation that learns dynamic emotion prototypes via vision-language alignment to enable fine-grained emotional control in VR content creation. EmoSpace uses a hierarchical emotion representation with learnable prototypes that evolve during training, allowing for control without explicit emotion labels. Experiments demonstrate EmoSpace's superior performance in emotional image outpainting, stylized generation, and emotional panorama generation, further validated by a user study comparing emotional perception in VR versus desktop environments.
Introduces a novel emotion-aware content generation framework, EmoSpace, that learns dynamic, interpretable emotion prototypes through vision-language alignment.
This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.
Introduces ImagineAgent, a novel agentic framework that leverages cognitive maps and generative tools to enhance OV-HOI comprehension by mitigating cross-modal hallucinations and occlusion ambiguity.
This paper proposes a Unified Smart Safety and Security Architecture for AI-driven mining environments, addressing challenges like poor illumination, GPS denial, and cyber-physical threats. The architecture integrates multimodal perception, secure federated learning, reinforcement learning, DTN communication, and energy-aware sensing to improve safety and security. The proposed system incorporates five core modules for miner localization, hazard understanding, federated robustness, and predictive maintenance.
Envisions and outlines a comprehensive architecture integrating diverse AI and security techniques to enhance safety and security in autonomous mining environments.
This paper introduces a high dynamic range (HDR) imaging system using a digital micromirror device (DMD) for spatial light modulation to address saturation issues in high-glare environments. The system autonomously segments regions and adaptively controls exposure using a DMD-based optical modulation unit and a computational imaging pipeline. Experimental results demonstrate a 127 dB dynamic range, a 78% reduction in strain error, and improved DIC positioning accuracy, validating the system's effectiveness in extreme lighting conditions.
Introduces a DMD-based adaptive modulation method for HDR imaging that significantly reduces saturation artifacts and improves measurement accuracy in high-glare environments.
The paper introduces Progressive Semantic Illusions, a vector sketching task where a single sketch transforms semantically through sequential stroke additions. They propose Stroke of Surprise, a generative framework using sequence-aware joint optimization with a dual-branch Score Distillation Sampling (SDS) mechanism to satisfy distinct semantic interpretations at different drawing stages. The method dynamically adjusts prefix strokes and uses a novel Overlay Loss to enforce spatial complementarity, achieving superior recognizability and illusion strength compared to baselines.
Introduces a sequence-aware joint optimization framework with a dual-branch SDS mechanism and Overlay Loss to generate vector sketches that progressively transform between distinct semantic interpretations.
The paper introduces STVG-R1, a reinforcement learning framework for spatial-temporal video grounding (STVG) that addresses misalignment between textual descriptions and visual coordinates by reformulating per-frame coordinate prediction as instance-level identification using temporally consistent IDs embedded as visual prompts. This approach avoids the need for additional trainable modules and complex alignment strategies. By employing a task-driven reward to optimize temporal accuracy, spatial consistency, and structural format regularization, STVG-R1 achieves state-of-the-art results on multiple STVG benchmarks and demonstrates strong zero-shot generalization capabilities.
Introduces a novel visual prompting paradigm for spatial-temporal video grounding that reformulates coordinate prediction as instance-level identification and optimizes the process using reinforcement learning.
The paper introduces AssetFormer, an autoregressive Transformer model for generating modular 3D assets from text descriptions, addressing the need for high-quality, diverse assets in the digital industry. AssetFormer models the generation of 3D assets as a sequence of primitives with constrained design parameters, adapting module sequencing and decoding techniques from language models. Experiments using real-world modular assets demonstrate the model's effectiveness in streamlining asset creation for professional development and UGC scenarios.
Introduces an autoregressive Transformer-based architecture, AssetFormer, for generating modular 3D assets from textual descriptions by modeling the asset as a sequence of primitives.
This paper introduces a lightweight RGB-D fusion framework to improve the efficiency and accuracy of Segment Anything Models (SAM). They augment EfficientViT-SAM with monocular depth priors generated by a pretrained estimator, fusing depth information mid-level with RGB features using a dedicated depth encoder. Training on only 11.2k samples, the proposed method outperforms EfficientViT-SAM, demonstrating the effectiveness of depth cues as geometric priors for segmentation.
Introduces a depth-aware fusion mechanism to enhance EfficientViT-SAM, enabling superior segmentation performance with significantly reduced training data.
The paper identifies limitations in current Vision-Language-Action (VLA) models stemming from inadequate visual representations learned through language-image contrastive learning or image-based self-supervised learning. It proposes JEPA-VLA, a method that integrates video predictive embeddings (specifically V-JEPA 2) into VLAs to improve environment understanding and policy priors. Experiments on benchmarks like LIBERO and real-robot tasks demonstrate that JEPA-VLA significantly improves performance by leveraging the ability of video predictive embeddings to encode task-relevant temporal dynamics.
Introduces JEPA-VLA, a novel approach that adaptively integrates video predictive embeddings into existing VLAs to enhance environment understanding and policy priors.
This paper introduces a semantically conditioned latent diffusion model (LDM) for synthesizing arterial-phase cerebral digital subtraction angiography (DSA) images, addressing the scarcity of DSA data due to its invasive nature. The LDM is conditioned on text embeddings representing anatomical circulation (anterior/posterior) and C-arm positions, enabling explicit control over the synthesis process. Evaluation by medical experts showed high clinical realism with Likert scores of 3.1-3.3 and a low Fréchet inception distance (FID) of 15.27, demonstrating the potential for generating realistic synthetic DSAs for research and training.
Demonstrates semantically controlled synthesis of realistic cerebral DSA images using a latent diffusion model conditioned on anatomical and geometric parameters.
This paper introduces U-DAVI, an uncertainty-aware amortized variational inference framework for image reconstruction that leverages diffusion priors. By injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, U-DAVI focuses learning on uncertain regions, improving reconstruction quality. Experiments on deblurring and super-resolution tasks demonstrate that U-DAVI achieves competitive or superior performance compared to existing diffusion-based methods, while maintaining computational efficiency.
Introduces an uncertainty-aware training strategy for amortized variational inference with diffusion priors, enabling improved image reconstruction by focusing learning on uncertain regions.
This paper addresses the limited generalization of diffusion-based policies in semantic manipulation by introducing bounding-box instructions to guide the policy's attention to target objects. They developed Label-UMI, a handheld segmentation device with an automated annotation pipeline, to efficiently collect demonstration data with semantic labels. Through real-world experiments, the authors demonstrated improved generalization and adaptability using a semantic-motion-decoupled framework and revealed a power-law relationship between generalization performance and the number of bounding-box objects, achieving 85% success rates across various tasks.
Demonstrates that bounding-box guided diffusion policies, trained on large-scale datasets collected with a novel handheld segmentation device, significantly improve generalization in semantic manipulation tasks and exhibit a power-law scaling relationship.
The paper introduces Transform Domain Fusion UNet (TD-FusionUNet), a lightweight deep learning model for next-day wildfire spread prediction using multimodal satellite data. The model incorporates trainable Hadamard Transform and Discrete Cosine Transform layers to capture frequency components in orthogonalized latent spaces, along with custom preprocessing techniques for sparse pre-fire masks. Evaluated on the Next-Day Wildfire Spread and WildfireSpreadTS datasets, TD-FusionUNet achieves an F1 score of 0.591 with only 370k parameters, surpassing a ResNet18-based UNet baseline in the WildfireSpreadTS dataset.
Introduces a novel U-Net architecture, TD-FusionUNet, that leverages trainable Hadamard and Discrete Cosine Transforms to efficiently capture frequency components in latent spaces for improved wildfire spread prediction.
This paper introduces neck-mounted egocentric gaze estimation and presents a new dataset of 4 hours of video from 8 participants performing daily activities. They evaluate a transformer-based gaze estimation model (GLC) and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach with a geometry-aware loss. The auxiliary classification task improves performance, while the co-learning approach does not.
Introduces a new task of neck-mounted egocentric gaze estimation and provides a corresponding dataset to facilitate research in this area.
The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.
Introduces a hierarchical subspace latent diffusion model (SLD-L2S) for lip-to-speech synthesis that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, enabling the incorporation of SLM and semantic losses via reparameterized flow matching.
This paper introduces GR-Diffusion, a novel framework for 3D whole-body PET reconstruction that combines a 3D Gaussian representation (GR) with diffusion models. GR is used to generate a reference 3D PET image from projection data, providing a geometric prior to guide the diffusion process. A hierarchical guidance mechanism refines local details and corrects deviations, enabling the diffusion model to integrate the GR prior and recover sub-voxel information.
Introduces a GR-Diffusion framework that leverages 3D Gaussian representations to guide diffusion models for improved 3D whole-body PET reconstruction, achieving state-of-the-art performance.
This paper introduces LLM-DRS, a novel Large Language Model (LLM)-based framework for disaster reconnaissance summarization in structural health monitoring. The framework integrates vision data and metadata from on-site investigations, using deep convolutional neural networks to extract key attributes like damage state and material type. The extracted data, along with carefully designed prompts, are then fed into an LLM to generate summary reports for individual structures or affected regions.
Introduces a novel LLM-based framework, LLM-DRS, that automates the generation of structural reconnaissance reports by integrating vision data, metadata, and deep learning-extracted attributes.

