Multimodal Models
CapabilitiesModels that process and generate across multiple modalities: vision-language, audio-text, and unified multimodal architectures.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces a hybrid Mamba-Transformer (MT) framework for remote sensing image super-resolution, aiming to overcome the limitations of CNNs and transformers in capturing long-range dependencies and maintaining computational efficiency. MT combines a focused mamba block (FMB) with a snake vision state-space module (SVSSM) for global feature modeling and a pixel-adaptive block (PAB) for pixel-level multiscale enhancement. Experiments on benchmark datasets demonstrate that MT outperforms state-of-the-art methods, achieving a better trade-off between performance and computational cost, specifically reducing parameters and FLOPs compared to MambaIRv2 while improving PSNR.
Introduces a novel hybrid Mamba-Transformer architecture that leverages a snake vision state-space module within a Mamba block to improve long-range dependency modeling and reduce computational redundancy for remote sensing image super-resolution.
The paper introduces the Visual Reasoning Benchmark (VRB), a new dataset of 701 visual reasoning questions sourced from primary school exams in Zambia and India, designed to evaluate multimodal large language models (MLLMs). The VRB focuses on minimal-text images to simulate realistic classroom visual reasoning problems, covering tasks like analogy, pattern completion, and spatial matching. Experiments using the VRB reveal that MLLMs exhibit a "jagged frontier" of capabilities, performing well on static tasks like counting but struggling with dynamic spatial operations like folding and rotation.
Introduces the Visual Reasoning Benchmark (VRB), a novel dataset of classroom-authentic visual reasoning problems, to evaluate the spatial reasoning capabilities of MLLMs.
This paper introduces CSEval, a framework for evaluating the clinical semantic alignment between text prompts and generated medical images, addressing the limitations of existing metrics focused on realism and diversity. CSEval uses language models to identify semantic inconsistencies related to anatomical location and pathology, demonstrating a correlation with expert clinical judgment. The framework offers a scalable method for assessing the clinical reliability of generated medical images, crucial for the safe deployment of text-to-image models in healthcare.
Introduces CSEval, a novel language model-based framework, to evaluate the clinical semantic alignment between text prompts and generated medical images.
The paper introduces DeepSight, an open-source toolkit designed to integrate safety evaluation and diagnosis for large language models (LLMs) and multimodal large language models (MLLMs). DeepSight combines DeepSafe, an evaluation toolkit, and DeepScan, a diagnosis toolkit, to provide a more comprehensive safety workflow. By unifying task and data protocols, DeepSight aims to bridge the gap between black-box risk evaluation and white-box mechanistic understanding, facilitating targeted safety alignment.
Introduces DeepSight, the first open-source toolkit to support frontier AI risk evaluation and joint safety evaluation and diagnosis by unifying task and data protocols.
The paper introduces IncompeBench, a new benchmark for Music Information Retrieval (MIR) consisting of 1,574 permissively licensed music snippets, 500 diverse queries, and over 125,000 relevance judgements. This benchmark addresses the lack of high-quality evaluation datasets in MIR, enabling more rigorous and reproducible research. High inter-annotator agreement was achieved through a multi-stage annotation pipeline, ensuring data quality.
Provides IncompeBench, a permissively licensed, fine-grained benchmark dataset to facilitate advancements in music information retrieval.
The paper introduces Cross-Modal Robustness Transfer (CMRT) to improve the robustness of End-to-End Speech Translation (E2E-ST) models against morphological variations. CMRT leverages adversarial training in the text modality to transfer robustness to the speech modality, eliminating the need for computationally expensive adversarial speech data generation. Experiments across four language pairs show that CMRT improves adversarial robustness by over 3 BLEU points compared to baseline E2E-ST models.
Introduces Cross-Modal Robustness Transfer (CMRT), a novel framework for enhancing E2E-ST model robustness by transferring adversarial robustness from text to speech.
The paper introduces PosterOmni, a framework for generalized artistic poster creation that tackles both local image editing and global design creation aspects of the task. It achieves this by constructing a multi-task dataset, distilling knowledge from local and global expert models, and applying a unified reward feedback mechanism to align visual fidelity and aesthetic preferences. Experiments on the new PosterOmni-Bench demonstrate that PosterOmni outperforms existing open-source and proprietary systems in reference adherence, composition, and aesthetics.
Introduces a novel data-distillation-reward pipeline to unify local image editing and global design creation for generalized artistic poster generation.
The paper introduces TexSpot, a diffusion-based texture enhancement framework that addresses view-inconsistency and resolution limitations in 3D texture generation. TexSpot utilizes a novel 3D texture representation called Texlet, which combines point-based and UV-based approaches by encoding local texture patches with a 2D encoder and aggregating them with a 3D encoder. Experiments show that TexSpot significantly improves visual fidelity, geometric consistency, and robustness compared to existing state-of-the-art methods.
Introduces Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representations.
The paper introduces RI-Mamba, a rotation-invariant state-space model for text-to-shape retrieval that addresses the limitations of existing methods in handling objects with arbitrary orientations and diverse categories. RI-Mamba disentangles pose from geometry using global and local reference frames and Hilbert sorting to create rotation-invariant token sequences. The model incorporates orientational embeddings via feature-wise linear modulation and employs cross-modal contrastive learning with automated triplet generation for scalable training, achieving state-of-the-art results on the OmniObject3D benchmark.
Introduces a novel rotation-invariant state-space model, RI-Mamba, for robust text-to-shape retrieval by disentangling pose from geometry and incorporating orientational embeddings.
The paper introduces OMEGA-Avatar, a novel feed-forward framework for generating animatable, 360-degree complete 3D Gaussian head avatars from a single image. To achieve this, the method incorporates a semantic-aware mesh deformation module for improved hair modeling and a multi-view feature splatting module to construct a shared canonical UV representation. Experiments demonstrate that OMEGA-Avatar outperforms existing methods in 360-degree full-head completeness and identity preservation.
Introduces a feed-forward framework, OMEGA-Avatar, that generates generalizable, 360-degree complete, and animatable 3D Gaussian head avatars from a single image by combining semantic-aware mesh deformation and multi-view feature splatting.
The paper investigates modality arbitration in Audio-LLMs, revealing a strong bias towards text over audio when the two modalities conflict, even when audio quality is superior. Using the ALME benchmark, the authors demonstrate that Gemini 2.0 Flash exhibits significantly higher text dominance in audio-text conflicts compared to text-text conflicts. They propose that this text dominance arises from an asymmetry in arbitration accessibility rather than information content, and provide evidence through interventions like forced transcription and fine-tuning ablations.
Reveals and analyzes a significant text dominance bias in audio-LLMs during modality arbitration, attributing it to differences in representational accessibility rather than information content.
The paper introduces a supervise-assisted multi-modality fusion diffusion model (MFdiff) to restore standard-dose PET (SPET) images from low-dose PET (LPET) and MR images. MFdiff uses a multi-modality feature fusion module to learn optimized fusion features from MR images and incorporates these features as additional conditions in a diffusion model for iterative SPET image generation. A two-stage supervise-assisted learning strategy leverages both generalized priors from simulated data and specific priors from in-vivo data to improve restoration quality, demonstrating superior performance compared to existing methods.
Introduces a novel supervise-assisted multi-modality fusion diffusion model (MFdiff) that effectively leverages MR images to restore high-quality SPET images from LPET data by using a two-stage training approach.
The paper introduces Spatial Chain-of-Thought (SCoT), a framework that combines the spatial reasoning of Multimodal Large Language Models (MLLMs) with the generative capabilities of diffusion models for improved image generation. SCoT trains a diffusion model on interleaved text-coordinate instructions to enhance layout awareness and uses MLLMs as planners to generate detailed layout plans. Experiments show SCoT achieves state-of-the-art performance on image generation benchmarks and excels in complex reasoning and image editing tasks.
Introduces Spatial Chain-of-Thought (SCoT), a novel plug-and-play framework that bridges MLLM reasoning and diffusion model generation by training the diffusion model with interleaved text-coordinate instructions and using MLLMs for spatial planning.
The paper introduces DreamID-Omni, a unified framework for human-centric audio-video generation, addressing tasks like reference-based generation, video editing, and audio-driven animation within a single model. It tackles the challenge of disentangling character identities and voice timbres by employing a Dual-Level Disentanglement strategy and a Symmetric Conditional Diffusion Transformer. Experimental results demonstrate state-of-the-art performance in video, audio, and audio-visual consistency, surpassing even proprietary commercial models.
Introduces a unified framework, DreamID-Omni, that achieves state-of-the-art performance on a range of human-centric audio-video generation tasks by disentangling identity and timbre control.
This paper investigates the use of local vision-language models (VLMs) to improve fine-grained activity recognition in newborn resuscitation videos, comparing them to a TimeSformer baseline. The authors explored zero-shot VLM strategies and fine-tuned VLMs with LoRA on a simulated dataset of 13.26 hours of video. Fine-tuning a local VLM with LoRA achieved an F1 score of 0.91, outperforming the TimeSformer baseline (0.70), suggesting the potential of VLMs for this task.
Demonstrates that fine-tuning local vision-language models with LoRA can significantly improve activity recognition in newborn resuscitation videos compared to a TimeSformer baseline.
This paper presents an anatomical analysis of text prompting within vision-language segmentation models, specifically SAM3, revealing significant redundancy in text encoder utilization. Based on these findings, they propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student. Experiments demonstrate that SAM3-LiteText reduces text encoder parameters by up to 88% while maintaining segmentation performance on image and video segmentation benchmarks.
Introduces SAM3-LiteText, a distilled MobileCLIP-based text encoder, to significantly reduce the computational and memory overhead of SAM3's text encoder without sacrificing segmentation accuracy.
This paper introduces ViTaS, a visuomotor learning framework that leverages both visual and tactile information through Soft Fusion Contrastive Learning and a CVAE module to improve performance in manipulation tasks, especially in occluded scenarios. The Soft Fusion Contrastive Learning method is designed to better exploit the alignment and complementarity of visual and tactile representations. Experiments across 12 simulated and 3 real-world environments demonstrate that ViTaS significantly outperforms existing baselines, highlighting the benefits of the proposed fusion and contrastive learning approach.
Introduces Soft Fusion Contrastive Learning to effectively fuse visual and tactile information for visuomotor tasks, improving performance in occluded scenarios by explicitly modeling the complementary nature of the two modalities.
The paper introduces DeepGen 1.0, a 5B parameter unified multimodal model for image generation and editing, designed to be lightweight and efficient compared to larger models. To enhance semantic understanding in the compact model, they propose Stacked Channel Bridging (SCB) to extract and fuse hierarchical features from VLMs with learnable 'think tokens'. They also employ a three-stage data-centric training strategy, including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO, achieving state-of-the-art performance on benchmarks like WISE and UniREditBench while using only 50M training samples.
Introduces Stacked Channel Bridging (SCB), a novel deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to improve the generative backbone's semantic understanding and fine-grained control.
The paper introduces MuRGAt, a new benchmark for evaluating fact-level multimodal attribution in complex reasoning scenarios involving video, audio, and other modalities. MuRGAt requires models to generate answers with explicit reasoning and precise citations that specify modality and temporal segments. The authors also present an automatic evaluation framework that correlates with human judgments, revealing that current MLLMs often hallucinate citations even with correct reasoning, and that increasing reasoning depth can degrade attribution accuracy.
Introduces MuRGAt, a challenging benchmark and automatic evaluation framework for fact-level multimodal attribution that exposes limitations in current MLLMs' ability to ground reasoning in heterogeneous input sources.
The paper introduces A$^{2}$V-SLP, an alignment-aware variational framework for sign language production that learns disentangled latent distributions for each articulator. This approach uses a disentangled VAE to encode sign pose sequences and extract articulator-specific mean and variance vectors, which then serve as distributional supervision for a non-autoregressive Transformer that predicts latent means and log-variances from text embeddings. By employing stochastic sampling and a gloss attention mechanism, A$^{2}$V-SLP achieves state-of-the-art back-translation performance and enhances motion realism in gloss-free sign language production.
Introduces an alignment-aware variational framework (A$^{2}$V-SLP) that learns disentangled latent distributions for sign language production, improving back-translation performance and motion realism.
This paper introduces VLAW, an iterative algorithm for co-improving vision-language-action (VLA) policies and action-conditioned video generation world models using real-world rollouts. VLAW leverages real-world data to refine the world model, which is then used to generate synthetic data for further policy improvement, addressing the limitations of world models trained solely on demonstration datasets. Experiments on a real robot demonstrate a 39.2% absolute improvement in success rate over the base policy, highlighting the effectiveness of the iterative co-improvement strategy.
Introduces an iterative co-improvement algorithm, VLAW, that refines both a vision-language-action policy and an action-conditioned video generation world model through interleaved real-world data collection and synthetic data generation.
The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.
Introduces a novel object-aligned visual contrastive decoding method that masks salient visual features to mitigate object hallucinations in MLLMs.
This paper investigates gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 by generating 3,200 images from semantically neutral prompts. Using a pipeline involving color normalization, facial landmark masking, and skin tone quantification via Monk, PERLA, and Fitzpatrick scales, the study reveals a "default white" bias in both models. Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones, demonstrating that neutral prompts elicit polarized demographic defaults.
Quantifies and compares gender and skin-tone biases in Gemini Flash 2.5 Image and GPT Image 1.5 using a rigorous colorimetric methodology.
This paper explores test-time verification as a method to improve vision-language-action (VLA) alignment, addressing the "intention-action gap" in embodied instruction following. They demonstrate that scaling both rephrased instructions and generated actions at test time enhances sample diversity and improves action selection. The authors introduce CoVer, a contrastive verifier, and a hierarchical verification inference pipeline, showing that this verification approach outperforms scaling policy pre-training on the SIMPLER and PolaRiS benchmarks.
Demonstrates that scaling test-time verification, through diverse instruction rephrasing and action candidate generation, is more effective than scaling policy pre-training for vision-language-action alignment.
The paper introduces 3DGSNav, a zero-shot object navigation (ZSON) framework that leverages 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to improve spatial reasoning. 3DGSNav actively constructs a 3DGS representation of the environment and uses trajectory-guided free-viewpoint rendering to generate frontier-aware first-person views, which are then combined with structured visual prompts and Chain-of-Thought prompting to enhance VLM reasoning. Experiments on multiple benchmarks and a quadruped robot show that 3DGSNav achieves competitive performance compared to existing methods.
Introduces a novel zero-shot object navigation framework that integrates 3D Gaussian Splatting as persistent memory for vision-language models, enabling trajectory-guided free-viewpoint rendering and enhanced spatial reasoning.
This paper introduces a French-focused benchmark for PDF-to-Markdown conversion using VLMs, addressing the lack of evaluation datasets for non-English documents and the over-penalization of formatting variations in existing benchmarks. The benchmark consists of challenging French documents selected via model-disagreement sampling and is evaluated using unit-test-style checks targeting specific failure modes like text presence and reading order, combined with category-specific normalization. Results across 15 models show that proprietary models exhibit higher robustness on handwriting and forms, while open-weight models are competitive on standard layouts.
Introduces a new French-language PDF-to-Markdown benchmark with targeted unit tests and category-specific normalization to more accurately assess VLM performance in RAG pipelines.
The paper introduces EmoSpace, a framework for emotion-aware content generation that learns dynamic emotion prototypes via vision-language alignment to enable fine-grained emotional control in VR content creation. EmoSpace uses a hierarchical emotion representation with learnable prototypes that evolve during training, allowing for control without explicit emotion labels. Experiments demonstrate EmoSpace's superior performance in emotional image outpainting, stylized generation, and emotional panorama generation, further validated by a user study comparing emotional perception in VR versus desktop environments.
Introduces a novel emotion-aware content generation framework, EmoSpace, that learns dynamic, interpretable emotion prototypes through vision-language alignment.
This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.
Introduces ImagineAgent, a novel agentic framework that leverages cognitive maps and generative tools to enhance OV-HOI comprehension by mitigating cross-modal hallucinations and occlusion ambiguity.
The paper introduces Hi-SAM, a novel multi-modal recommendation framework designed to address limitations in semantic ID-based approaches, specifically suboptimal tokenization and architecture-data mismatch. Hi-SAM employs a Disentangled Semantic Tokenizer (DST) that uses geometry-aware alignment and coarse-to-fine quantization to separate shared and modality-specific semantics, and a Hierarchical Memory-Anchor Transformer (HMAT) that incorporates hierarchical positional encoding and anchor tokens to better model user-item interactions. Experiments on real-world datasets and a large-scale social platform demonstrate that Hi-SAM outperforms state-of-the-art baselines, particularly in cold-start scenarios, achieving a 6.55% improvement in a core online metric.
Introduces a hierarchical structure-aware multi-modal framework, Hi-SAM, that disentangles cross-modal semantics and modality-specific details during tokenization and incorporates hierarchical positional encoding within a transformer architecture for improved recommendation performance.
The paper introduces Progressive Semantic Illusions, a vector sketching task where a single sketch transforms semantically through sequential stroke additions. They propose Stroke of Surprise, a generative framework using sequence-aware joint optimization with a dual-branch Score Distillation Sampling (SDS) mechanism to satisfy distinct semantic interpretations at different drawing stages. The method dynamically adjusts prefix strokes and uses a novel Overlay Loss to enforce spatial complementarity, achieving superior recognizability and illusion strength compared to baselines.
Introduces a sequence-aware joint optimization framework with a dual-branch SDS mechanism and Overlay Loss to generate vector sketches that progressively transform between distinct semantic interpretations.
The paper introduces STVG-R1, a reinforcement learning framework for spatial-temporal video grounding (STVG) that addresses misalignment between textual descriptions and visual coordinates by reformulating per-frame coordinate prediction as instance-level identification using temporally consistent IDs embedded as visual prompts. This approach avoids the need for additional trainable modules and complex alignment strategies. By employing a task-driven reward to optimize temporal accuracy, spatial consistency, and structural format regularization, STVG-R1 achieves state-of-the-art results on multiple STVG benchmarks and demonstrates strong zero-shot generalization capabilities.
Introduces a novel visual prompting paradigm for spatial-temporal video grounding that reformulates coordinate prediction as instance-level identification and optimizes the process using reinforcement learning.
The paper introduces AssetFormer, an autoregressive Transformer model for generating modular 3D assets from text descriptions, addressing the need for high-quality, diverse assets in the digital industry. AssetFormer models the generation of 3D assets as a sequence of primitives with constrained design parameters, adapting module sequencing and decoding techniques from language models. Experiments using real-world modular assets demonstrate the model's effectiveness in streamlining asset creation for professional development and UGC scenarios.
Introduces an autoregressive Transformer-based architecture, AssetFormer, for generating modular 3D assets from textual descriptions by modeling the asset as a sequence of primitives.
This paper introduces a lightweight RGB-D fusion framework to improve the efficiency and accuracy of Segment Anything Models (SAM). They augment EfficientViT-SAM with monocular depth priors generated by a pretrained estimator, fusing depth information mid-level with RGB features using a dedicated depth encoder. Training on only 11.2k samples, the proposed method outperforms EfficientViT-SAM, demonstrating the effectiveness of depth cues as geometric priors for segmentation.
Introduces a depth-aware fusion mechanism to enhance EfficientViT-SAM, enabling superior segmentation performance with significantly reduced training data.
The paper identifies limitations in current Vision-Language-Action (VLA) models stemming from inadequate visual representations learned through language-image contrastive learning or image-based self-supervised learning. It proposes JEPA-VLA, a method that integrates video predictive embeddings (specifically V-JEPA 2) into VLAs to improve environment understanding and policy priors. Experiments on benchmarks like LIBERO and real-robot tasks demonstrate that JEPA-VLA significantly improves performance by leveraging the ability of video predictive embeddings to encode task-relevant temporal dynamics.
Introduces JEPA-VLA, a novel approach that adaptively integrates video predictive embeddings into existing VLAs to enhance environment understanding and policy priors.
The paper introduces Transform Domain Fusion UNet (TD-FusionUNet), a lightweight deep learning model for next-day wildfire spread prediction using multimodal satellite data. The model incorporates trainable Hadamard Transform and Discrete Cosine Transform layers to capture frequency components in orthogonalized latent spaces, along with custom preprocessing techniques for sparse pre-fire masks. Evaluated on the Next-Day Wildfire Spread and WildfireSpreadTS datasets, TD-FusionUNet achieves an F1 score of 0.591 with only 370k parameters, surpassing a ResNet18-based UNet baseline in the WildfireSpreadTS dataset.
Introduces a novel U-Net architecture, TD-FusionUNet, that leverages trainable Hadamard and Discrete Cosine Transforms to efficiently capture frequency components in latent spaces for improved wildfire spread prediction.
The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.
Introduces a hierarchical subspace latent diffusion model (SLD-L2S) for lip-to-speech synthesis that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, enabling the incorporation of SLM and semantic losses via reparameterized flow matching.
This paper introduces LLM-DRS, a novel Large Language Model (LLM)-based framework for disaster reconnaissance summarization in structural health monitoring. The framework integrates vision data and metadata from on-site investigations, using deep convolutional neural networks to extract key attributes like damage state and material type. The extracted data, along with carefully designed prompts, are then fed into an LLM to generate summary reports for individual structures or affected regions.
Introduces a novel LLM-based framework, LLM-DRS, that automates the generation of structural reconnaissance reports by integrating vision data, metadata, and deep learning-extracted attributes.
The paper introduces Code2Worlds, a framework for generating 4D dynamic scenes by formulating the task as language-to-simulation code generation. It addresses the challenges of multi-scale context entanglement and the semantic-physical execution gap by using a dual-stream architecture for disentangled object and environment generation, combined with a physics-aware closed-loop mechanism involving a PostProcess Agent and VLM-Motion Critic. Experiments on the Code4D benchmark demonstrate that Code2Worlds significantly outperforms existing methods in scene generation score (SGS) and richness, while also generating more physically plausible dynamics.
Introduces a novel framework, Code2Worlds, that leverages coding LLMs to generate physically plausible 4D dynamic scenes through a dual-stream architecture and physics-aware closed-loop refinement.
The paper introduces HoloBrain-0, a Vision-Language-Action (VLA) framework designed to improve real-world robot deployment by incorporating robot embodiment priors like multi-view camera parameters and URDF into its architecture. They employ a "pre-train then post-train" paradigm, achieving SOTA results on simulation benchmarks and strong performance on real-world manipulation tasks, even with a small 0.2B-parameter variant. The authors open-source the entire HoloBrain ecosystem, including pre-trained models, post-trained checkpoints, and a full-stack VLA infrastructure called RoboOrchard, to facilitate research and adoption.
Introduces a novel VLA architecture, HoloBrain-0, that explicitly incorporates robot embodiment priors to enhance 3D spatial reasoning and improve performance in both simulation and real-world robotic manipulation tasks.
This paper introduces ReNoV, a diffusion-based novel view synthesis framework that leverages external visual representations as conditions to improve geometric consistency. The authors analyze the correspondence capabilities within spatial attention of external representations and use this to guide the diffusion process via representation projection modules. Experiments demonstrate that ReNoV achieves significant improvements in reconstruction fidelity and inpainting quality compared to existing diffusion-based methods, particularly in scenarios with sparse, unposed images.
Introduces a novel representation projection module to effectively inject external visual representations into a diffusion model for improved novel view synthesis.
The paper introduces GigaBrain-0.5M*, a vision-language-action (VLA) model trained using world model-based reinforcement learning to improve multi-step action prediction. They leverage the spatiotemporal reasoning capabilities of video world models pre-trained on large video datasets to enhance VLA learning. By integrating world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy), GigaBrain-0.5M* achieves significant performance gains (approximately 30%) over the RECAP baseline on complex manipulation tasks and demonstrates reliable long-horizon execution in real-world deployments.
Demonstrates that integrating world model-based reinforcement learning via RAMP into a VLA model significantly improves performance and long-horizon execution on complex manipulation tasks.
This paper introduces a method for learning structured latent representations in RL where distances reflect transition costs, providing a geometric interpretation of uncertainty without explicit probabilistic modeling. They achieve this with a multimodal latent transition model and inverse distance weighting for sensor fusion, enabling adaptive integration of multiple sensor modalities. Empirical validation on multimodal RL tasks demonstrates improved robustness to sensor noise, superior state estimation, and enhanced RL agent performance compared to baselines, eliminating the need for noise augmentation.
Introduces a novel metric space formulation for state estimation in RL that learns a transition-aware latent representation, enabling a geometric interpretation of uncertainty and adaptive sensor fusion.
The paper introduces MTL-VQA, a multi-task learning framework for no-reference video quality assessment (NR-VQA) of gaming videos, addressing the scarcity of human-rated data by leveraging full-reference (FR) metrics as supervisory signals. By adaptively weighting and jointly optimizing multiple FR objectives during pretraining, the method learns shared perceptual representations relevant to video quality. Experiments demonstrate that MTL-VQA achieves competitive performance compared to state-of-the-art NR-VQA methods in both MOS-supervised and label-efficient settings.
Introduces a multi-task learning framework, MTL-VQA, that learns perceptual representations for gaming NR-VQA by pretraining on multiple full-reference metrics with adaptive task weighting, eliminating the need for human labels during pretraining.
The paper addresses the problem of biased uncertainty estimation in Test-Time Adaptation (TTA) of vision-language models like CLIP, which arises from pre-training on imbalanced web data. They propose Adaptive Debiasing Tsallis Entropy (ADTE), a generalization of Shannon Entropy that incorporates a class-specific parameter to account for label bias estimated from incoming test instances. ADTE outperforms state-of-the-art TTA methods on ImageNet variants and cross-domain benchmarks by accurately selecting high-confidence views and integrating with a label adjustment strategy.
Introduces Adaptive Debiasing Tsallis Entropy (ADTE), a novel entropy measure for test-time adaptation that dynamically adjusts for label bias in vision-language models.
The paper introduces UniT, a framework for multimodal chain-of-thought test-time scaling that allows a unified model to iteratively reason, verify, and refine its outputs. UniT employs agentic data synthesis to create training data, trains a unified model, and uses flexible test-time inference to encourage cognitive behaviors. Experiments demonstrate that models trained on short reasoning trajectories generalize to longer inference chains, sequential chain-of-thought reasoning is more scalable than parallel sampling, and training on generation/editing trajectories improves out-of-distribution visual reasoning.
Introduces UniT, a novel framework enabling multimodal chain-of-thought test-time scaling for unified models, facilitating iterative reasoning, verification, and refinement.
The paper introduces Robot-DIFT, a framework that distills geometric priors from a frozen diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN) to improve visuomotor control. This distillation process aims to address the structural mismatch between vision encoders optimized for semantic invariance and the geometric sensitivity required for precise manipulation. Robot-DIFT, pretrained on the DROID dataset, achieves superior geometric consistency and control performance compared to discriminative baselines by leveraging the geometric dependencies encoded within diffusion model latent manifolds.
Introduces a manifold distillation approach, Robot-DIFT, to transfer geometric priors from a frozen diffusion model into a deterministic feature network, enabling geometrically consistent visuomotor control.
The paper introduces ABot-N0, a Vision-Language-Action (VLA) foundation model designed for unified embodied navigation across five core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 employs a hierarchical "Brain-Action" architecture, combining an LLM-based cognitive brain for semantic reasoning with a Flow Matching-based action expert for trajectory generation. The model is trained on a large-scale dataset of 16.9M expert trajectories and 5.0M reasoning samples, achieving state-of-the-art performance on seven benchmarks and demonstrating robust long-horizon navigation in real-world environments.
Introduces a unified Vision-Language-Action foundation model, ABot-N0, that achieves state-of-the-art performance across a diverse set of embodied navigation tasks.
The paper introduces WebTestPilot, an LLM-based agent for end-to-end web testing against natural language specifications that addresses the challenges of implicit oracle inference and probabilistic reasoning. WebTestPilot uses a symbolization layer to represent GUI elements as symbols and translates natural language into step-by-step instructions with inferred pre- and post-conditions over these symbols, effectively capturing data, temporal, and causal dependencies for validation. Experiments on a new benchmark of bug-injected web applications demonstrate that WebTestPilot achieves a 99% task completion rate with 96% precision and 96% recall in bug detection, significantly outperforming existing LLM-based approaches.
Introduces a novel approach to end-to-end web testing by inferring oracles with symbolized GUI elements, enabling the agent to validate implicit requirements and improve bug detection accuracy.
The paper introduces DiffPlace, a diffusion-based framework for generating place-controllable street view images from text, BEV maps, and object bounding boxes, specifically addressing the challenge of generating background-consistent urban scenes. DiffPlace employs a place-ID controller, using linear projection, a perceiver transformer, and contrastive learning to map place-ID embeddings into a CLIP space, enabling control over background consistency while allowing foreground variations. Experiments demonstrate that DiffPlace achieves superior generation quality and improves visual place recognition performance when used for data augmentation compared to existing methods.
Introduces a place-ID controller within a multi-view diffusion model to enable place-controllable street view generation, enhancing background consistency and foreground flexibility.
The paper introduces ScalSelect, a training-free multimodal data selection method for visual instruction tuning (VIT) that addresses the computational expense and redundancy of large-scale datasets. ScalSelect constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM and then identifies samples whose representations best approximate the dominant subspace of the full dataset. Experiments demonstrate that ScalSelect achieves comparable or superior performance to full-data training using significantly less data (e.g., 16%).
Introduces ScalSelect, a scalable training-free multimodal data selection method that achieves high performance in visual instruction tuning while significantly reducing computational costs.

