Google DeepMind

×Multimodal Models

8 papers from Google DeepMind on Multimodal Models

Apr 15, 2026

Apr 15, 2026·also DeepMind, BIFOLD -Berlin Institute for the Foundations, Helmholtz, Korea U +5

Context Sensitivity Improves Human-Machine Visual Alignment

Human-inspired context sensitivity boosts visual reasoning in machines, closing the gap between AI and human perception.

Frieda Born, Tom Neuhäuser, Lukas Muttenthaler +6

Computer Vision Multimodal Models

Apr 7, 2026

DeepMindApr 7, 2026

Beneath the Surface: Investigating LLMs'Capabilities for Communicating with Subtext

LLMs' struggle to grasp subtext—even generating literal clues 60% of the time—reveals a critical gap in their ability to understand nuanced human communication.

Kabir Ahuja, Kabir Ahuja, Andrew Kyle Lampinen +1

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Mar 18, 2026

DeepMindMar 18, 2026

Versatile Editing of Video Content, Actions, and Dynamics without Training

Forget finetuning: DynaEdit unlocks complex video edits like action modification and object insertion, all without training, using clever manipulation of pretrained text-to-video models.

Vladimir Kulikov, Roni Paiss, Andrey Voynov +3

Computer Vision Multimodal Models World Models & Planning

Mar 11, 2026

DeepMindMar 11, 2026

Taking Shortcuts for Categorical VQA Using Super Neurons

Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.

Pierre Musacchio, Jaeyi Jeong, Dahun Kim

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Feb 27, 2026

DeepMindFeb 27, 2026·also UCL

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

DINOv2's impressive unimodal performance doesn't translate to cross-modal understanding, but a simple training tweak can align embeddings across RGB, depth, and segmentation without sacrificing feature quality.

Rishabh Kabra, M. Ovsjanikov, Drew A. Hudson +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 25, 2026

DeepMindFeb 25, 2026·also Apple ML, Berkeley University, Institut National de la Recherche

The Design Space of Tri-Modal Masked Diffusion Models

Tri-modal masked diffusion models can now be trained from scratch, achieving strong results in text generation, text-to-image, and text-to-speech, thanks to a systematic exploration of the design space and a novel SDE-based batch size reparameterization.

Louis Bethune, L. Béthune, Victor Turrisi +42

Multimodal Models Scaling Laws & Emergent Abilities Speech & Audio

Feb 19, 2026

DeepMindFeb 19, 2026·also Google Research, NVIDIA, KU, UZH

Tree crop mapping of South America reveals links to deforestation and conservation

Existing deforestation monitoring maps misclassify smallholder agroforestry as "forest," risking unfair penalties under regulations like the EUDR.

Yuchang Jiang, Anton Raichuk, Xiaoye Tong +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Jan 31, 2026

Jan 31, 2026·also DeepMind, UMacau

Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

Forget textual descriptions – this zero-shot image retrieval method hallucinates the target image directly, outperforming the state-of-the-art by creating a whole synthetic world to match against.

Tong Wang, Shu Kong

Multimodal Models

Search

Google DeepMind