Search papers, labs, and topics across Lattice.
VLMs can be significantly boosted on embodied tasks by mid-training on a carefully curated subset of VLM data that is highly aligned with the VLA domain, rivaling the performance of much larger models.
Mismatched visual elements torpedo design harmony, but GIST offers a training-free fix that stylistically blends components, boosting aesthetic quality in existing pipelines.
Dramatically improve multimodal recommendation accuracy without any training by initializing user embeddings with item modality features and user cluster information.
Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.
Unlock 20x faster and more accurate 3D human-object contact estimation in complex, multi-person scenes with Pi-HOC, a framework that doesn't require object meshes.
Stop reimplementing multimodal models: TorchUMM offers a unified codebase for evaluation, analysis, and post-training, streamlining research across diverse architectures and tasks.
Achieve sub-centimeter robotic placement accuracy from compositional language instructions by decomposing the task into visual goal representation and goal-conditioned execution.
Imagine populating any 3D environment with digital humans that spontaneously navigate and interact, driven only by visual input and goals.
Video diffusion models already contain implicit multi-view knowledge, making them surprisingly effective for novel view synthesis when adapted to ignore temporal coherence.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Finally, digital humans can have realistic, socially aware conversations: DyaDiT generates dyadic gestures that users strongly prefer over existing methods.
Forget monolithic models: pMoE shows that ensembling diverse expert prompts within a single model framework yields surprisingly large gains in visual adaptation across a wide range of tasks.
By decomposing long-horizon manipulation into transport and object-centric interaction, LiLo-VLA achieves state-of-the-art zero-shot generalization and robustness, outperforming end-to-end VLA models by a large margin.
Forget cloud GPUs – a new model brings unified multimodal understanding and generation to your iPhone, running 6x faster than alternatives.
MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.
Forget slow text-based communication: Vision Wormhole unlocks faster multi-agent reasoning by turning VLMs into telepathic hubs, slashing runtime without sacrificing fidelity.
Stop treating generated images like real ones: GMAIL aligns them as separate modalities in a shared latent space, unlocking significant gains in vision-language tasks.
VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.
Forget static datasets – RL-based co-training unlocks +20% real-world VLA performance by interactively leveraging simulation while preserving real-world capabilities.
Synthetic data generated by fine-tuning Stable Diffusion on multi-region satellite imagery boosts small object detection accuracy by 20%, even when real labeled data is scarce.
Forget tedious manual annotation: FlexDataset crafts customized, high-fidelity annotated datasets with 5x faster annotation times using a composition-to-data approach.
Achieve semantically coherent image compositions by mixing layout-focused and appearance-focused visual representations in a diffusion model's cross-attention.