Search papers, labs, and topics across Lattice.
Tencent Hunyuan
6
0
6
4
Frame-level causal attention is all you need for effective visual reconstruction in unified multimodal models.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
LMMs can learn to generate images *and* improve their understanding abilities, without catastrophic forgetting, by carefully disentangling and sharing experts within a MoE architecture.
By tightly coupling reasoning, searching, and generation, Unify-Agent demonstrates that agent-based modeling can substantially improve world knowledge grounding in image synthesis, rivaling closed-source models.
Achieve SOTA in both visual generation and understanding by harmonizing generative and semantic representations within a single ViT architecture.
Ditch discrete visual tokens: UniCom achieves SOTA multimodal generation by compressing continuous semantic representations, unlocking better controllability and consistency in image editing.