CMU MLMay 21, 2026arXiv:2605.22344

Bernini: Latent Semantic Planning for Video Diffusion

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

AI Summary

Bernini unifies MLLMs and diffusion models for video generation and editing by using the MLLM for semantic planning in the ViT embedding space and a diffusion model for pixel rendering conditioned on this plan. This division of labor allows for separate training of the planner and renderer, preserving their pre-trained strengths and improving training efficiency. The framework incorporates Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE) and chain-of-thought reasoning, achieving state-of-the-art performance across various video generation and editing benchmarks.

Key Contribution

State-of-the-art video generation and editing now hinges on a surprisingly simple division of labor: MLLMs for semantic planning, diffusion models for photorealistic rendering.

Abstract

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bernini: Latent Semantic Planning for Video Diffusion

Related Papers