Search papers, labs, and topics across Lattice.
OmniSONAR halves cross-lingual search error on FLORES and reduces error by 15x on BIBLE, proving that truly universal sentence embeddings across thousands of languages and modalities are now within reach.
Pixel-space diffusion models get a serious boost: V-Co reveals a simple recipe for visual co-denoising that outperforms existing methods on ImageNet-256 with fewer training epochs.
Ditch the text prompts: AC-Foley uses reference audio to synthesize video sound effects with unprecedented control, enabling timbre transfer and zero-shot generation.
Self-supervised video models can now learn dense features rivaling supervised methods, unlocking a 20-point jump in robot grasping success.
A surprisingly simple change to the motion latent space—representing each body joint with its own token—dramatically improves text-to-motion generation quality, outperforming monolithic latent vector approaches.
Vision models are far more data-hungry than language models, but Mixture-of-Experts can harmonize this asymmetry for truly unified multimodal models.
Multimodal models often exhibit lower confidence than their unimodal counterparts when they're about to fail, and this work leverages that insight to build a better failure detector.
Unlocking the secrets of viral video ads: a new MLLM framework reveals which initial moments hook viewers and drive conversions.
By surgically intervening in MLLM decoding, this work cuts hallucination rates without sacrificing descriptive quality, a feat prior methods struggled to achieve.
Existing safety guardrails for text-to-image models can backfire, inadvertently amplifying other types of harm, but this new method adaptively steers generation to resolve these conflicts and reduce overall harmful content.
Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.
Forget ImageNet: Xray-Visual sets a new SOTA for multimodal vision models by scaling to billions of social media data points with a novel three-stage training pipeline.
Forget clunky skeletons: this new model lets you prompt your way to accurate 3D human meshes from single images, even in the wildest poses.
Unlock superhuman visual reasoning in multimodal models by simply giving them the ability to think step-by-step at test time.
Achieve state-of-the-art UAV detection by swapping transformers for Mamba, yielding a faster and more accurate multimodal detector.
Edit the bassline, drums, or other instruments of any song with this new open-source multi-stem music generation model.