Search papers, labs, and topics across Lattice.
3
0
6
Decomposing rasterized graphic designs into editable layers is now significantly more faithful and controllable thanks to a hybrid generative approach that combines vision-language models and multi-branch diffusion.
Current multimodal agents still struggle to combine ambiguous visual cues with open-web verification, highlighting a critical gap in their ability to perform complex geolocation tasks.
By adaptively calibrating facts and augmenting emotions, FACE-net overcomes the factual-emotional bias that plagues emotional video captioning.