Search papers, labs, and topics across Lattice.
Beijing Institute of Technology, Beijing, China
2
0
5
Achieve significantly better structure preservation in text-guided image editing by injecting structure-related features into visual autoregressive models, guided by reinforcement learning.
Multimodal LLMs primarily rely on language-unique information for final predictions, with visual information decaying across layers and cross-modal synergy remaining surprisingly low (under 2%).