Search papers, labs, and topics across Lattice.
Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.
Robots can now manipulate objects with greater dexterity and adaptability thanks to a new world model that leverages both vision and high-frequency tactile feedback to predict and react to contact dynamics.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
Current image generation unlearning methods are surprisingly brittle: adversarial image prompts, optimized with attention-guided masking, can effectively resurrect supposedly "forgotten" concepts.
Current multimodal models are surprisingly bad at understanding long, complex videos, struggling to integrate audio, visual, and text cues even for basic reasoning tasks.
MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.
Achieve real-time, synchronized audio-visual generation at 25 FPS by distilling a bidirectional diffusion model into a fast, autoregressive architecture, overcoming training instability with novel alignment and token handling techniques.
Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.
Text-to-video generation gets a 1.58x speed boost with CalibAtt, a training-free method that exploits consistent sparsity patterns in attention layers.
By combining video generation and vision-language models, EmboAlign achieves a 43% boost in real-world robot manipulation success without any task-specific training.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Forget everything you thought you knew about continual learning: pretrained Vision-Language-Action models can learn new robotic skills without catastrophic forgetting, even with minimal replay.
Multimodal models often exhibit lower confidence than their unimodal counterparts when they're about to fail, and this work leverages that insight to build a better failure detector.
By explicitly guiding attention with predicted action sequences, AGA overcomes the limitations of standard dot-product attention in video action anticipation, leading to better generalization and interpretability.
By explicitly disentangling shared and view-specific features across multi-view fundus images, MVGFDR achieves superior diabetic retinopathy grading compared to methods that directly fuse visual features.
Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.
Time series generation can be dramatically improved by explicitly conditioning on semantic understanding, as demonstrated by a novel vision-centric framework.
Forget painstakingly engineering robot behaviors: DreamZero learns directly from video of other robots or even humans, adapting to new tasks and bodies with just minutes of data.
Forget robotics pre-training: ActionCodec, a new action tokenizer designed with information-theoretic principles, achieves state-of-the-art VLA performance on LIBERO.
Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.
Key contribution not extracted.
Forget static datasets – RL-based co-training unlocks +20% real-world VLA performance by interactively leveraging simulation while preserving real-world capabilities.
By decoupling MLLM instruction tuning from DiT alignment, DuoGen achieves state-of-the-art interleaved multimodal generation without costly unimodal pretraining.
VLMs, typically praised for their multimodal synergy, can be easily weaponized to manipulate search rankings via imperceptible image perturbations and fluent textual suffixes, outperforming unimodal attacks.
Synthesizing realistic radar data from camera images is now possible, bridging the gap between visual and radar perception for autonomous driving.
Forget synthetic data that looks like it came from a PS2 game: NVIDIA's new Cosmos-Predict2.5 generates high-fidelity videos for training embodied AI, opening the door to more realistic and reliable simulations.
Multimodal ophthalmic AI is poised for a leap, but current models still struggle with data variability, limited annotations, and generalization across diverse patient populations.
A 3B parameter model, Audio Flamingo 2, now rivals larger proprietary models in audio understanding and reasoning, even handling audio segments up to 5 minutes long.