Search papers, labs, and topics across Lattice.
HKUST
3
0
6
5
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Unified multimodal models aren't truly unified: vision and language modalities exhibit divergent entropy patterns during encoding and generation, hindering effective reasoning-based image synthesis.
Forget wrestling with finicky text prompts or tedious manual camera paths: ShotVerse lets you generate cinematic multi-shot videos from text, thanks to its clever "Plan-then-Control" framework and a dataset of aligned camera trajectories.