Google Research

×Multimodal Models

11 papers from Google Research on Multimodal Models

Apr 27, 2026

Google Research3w ago·also LinkedIn Corporation

Co-Director: Agentic Generative Video Storytelling

Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.

Yale Song, Yale Song, Yiwen Song +29

Computer Vision Multimodal Models Tool Use & Agents

Apr 22, 2026

Google ResearchApr 22, 2026

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

Current remote sensing change captioning datasets miss fine-grained localized semantic reasoning, but RSRCC fills this gap with 126k change-specific questions.

Roie Kazoom, Yotam Gigi, George Leifman +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 22, 2026·also Google Research, VIA Research Center

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

LVLMs can self-detect and correct object hallucinations by focusing on specific image regions, offering a simple, training-free fix.

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Apr 15, 2026

Google ResearchApr 15, 2026·also UMD

CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

Generating consistent visual narratives is now possible: CANVAS outperforms existing methods by explicitly planning character, background, and scene continuity across multiple shots.

Ishani Mondal, I. Mondal, Mihir Parmar +5

Computer Vision Multimodal Models Tool Use & Agents

Mar 27, 2026

Google ResearchMar 27, 2026·also CREST-ENSAE, KU, Oxford

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Achieve world-consistent video generation by directly optimizing geometry in the latent space of pre-trained video diffusion models, sidestepping costly RGB-space operations and architectural changes.

Zhaochong An, Orest Kupyn, Théo Uscidda +5

Computer Vision Multimodal Models World Models & Planning

Mar 18, 2026

Munich Center for Machine LearningMar 18, 2026·also Google Research

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.

Rui Xiao, Sanghwan Kim, Yongqin Xian +2

Eval Frameworks & Benchmarks Multimodal Models

Mar 11, 2026

Google ResearchMar 11, 2026·also Columbia, Samsung, UMich

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.

Tianyu Xu, Sieun Kim, Qianhuizhi Zheng +5

Computer Vision Multimodal Models Speech & Audio

Mar 9, 2026

Google ResearchMar 9, 2026·also UC Santa Cruz

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Forget local semantic alignment: CAST unlocks temporally coherent video retrieval and generation by explicitly modeling visual state transitions.

Yanqing Liu, Yingcheng Liu, Fanghong Dong +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

Mar 6, 2026

Mar 6, 2026·also Google Research, A*STAR, Saarbrücken Research Center for Visual Computing, SUTD

Physical Simulator In-the-Loop Video Generation

AI-generated videos can now respect physics, thanks to a framework that uses a physical simulator to guide diffusion models, resulting in more realistic and coherent motion.

Lin Geng Foo, Mark He Huang, Alexandros Lattas +3

Computer Vision Multimodal Models World Models & Planning

Mar 4, 2026

Google ResearchMar 4, 2026·also BAIR, DeepMind

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents are surprisingly vulnerable to cross-modal attacks, but a novel adversarial training approach can double task completion efficiency while mitigating these risks.

Haoyu Liu, Dingcheng Li, Lukas Rutishauser +1

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

Feb 19, 2026

DeepMindFeb 19, 2026·also Google Research, NVIDIA, KU, UZH

Tree crop mapping of South America reveals links to deforestation and conservation

Existing deforestation monitoring maps misclassify smallholder agroforestry as "forest," risking unfair penalties under regulations like the EUDR.

Yuchang Jiang, Anton Raichuk, Xiaoye Tong +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Search

Google Research