Search papers, labs, and topics across Lattice.
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.
Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.
Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.
Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.
Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.
Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.
Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.
MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.
MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.
Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.
LLMs can now predict where drivers look with uncanny human-like accuracy, thanks to a new dataset and architecture that grounds attention in objects, not just scenes.
MLLMs still struggle to integrate diverse data for clinical reasoning, as evidenced by their poor performance on a new ophthalmology benchmark spanning image quality assessment to diagnosis.
Pocket-sized VLA models can now achieve state-of-the-art robot manipulation performance by pre-training on a curated multimodal dataset and injecting manipulation-relevant representations into the action space.
Stop fragmented land cover predictions: SSDM leverages global geospatial embeddings to guide local feature extraction, achieving state-of-the-art performance in high-resolution remote sensing mapping.
Freezing a Stable Diffusion backbone and injecting CLIP and BLIP features lets you beat the state-of-the-art in zero-shot sketch-based 3D shape retrieval, without any costly retraining.
MV-HGNN achieves superior 3D shape retrieval by effectively leveraging geometric dependencies and semantic alignment, outperforming existing methods in zero-shot settings.
Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.
VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.
Targeted neuron fine-tuning can unlock superior image translation capabilities in multimodal large language models, outperforming traditional methods by preserving pre-trained knowledge.
Autoregressive 3D layout generation can be both more physically plausible and significantly faster by repurposing existing 3D generative models.
Forget relying on fickle visuals: this new ReID method uses language to describe *who* a person is, not just what they look like, and it crushes existing benchmarks.
MLLMs still struggle to reason about everyday situations when they require identifying and using visual clues, despite excelling at tasks relying on pre-existing knowledge.
MLLMs don't just forget language, they also suffer from perceptual drift in cross-modal spaces, but MAny offers a training-free merging strategy to fix both.
Achieve photorealistic, identity-consistent facial video edits from text prompts without video training data, rivaling traditional rendering software.
Imagine creating high-fidelity, navigable 3D worlds from just a text prompt or a single image – HY-World 2.0 makes it a reality.
Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.
By explicitly modeling both consensus and discrepancy between RGB and IR data, this text-guided multispectral object detector significantly boosts performance on multispectral benchmarks.
Finally, a model that speaks fluent Lottie: LottieGPT generates editable vector animations directly from text or images, opening up a new frontier for resolution-independent, compact, and semantically structured multimedia creation.
Achieve state-of-the-art object detection accuracy and efficiency by fusing RGB frames and event streams with a sparse hypergraph and a fine-grained mixture of experts, enabling real-time edge deployment.
Achieve real-time (40 FPS at 720p) interactive video generation with minute-long memory consistency using a 5B parameter world model.
Robots can now better assemble boxes in the real world thanks to a video-generative value model that anticipates future states, moving beyond static snapshots for more reliable task progress assessment.
Medical MLLMs, despite their size and training data, stumble on basic image classification due to four key failure modes, revealing a disconnect between hype and clinical readiness.
Turns out, you can cut critical errors in VLM-generated image editing instructions in half with a clever two-stage training pipeline, leading to SOTA editing performance.
LLMs can now leverage visual structure, not just text, to pinpoint bugs in multimodal programs, thanks to a novel graph alignment approach that bridges the gap between GUI screenshots and code.
World models are more valuable for synthesizing structured supervision for navigation learning than for directly providing action-ready imagined evidence.
Forget fixed pipelines: training an agent to *learn* when and how to search for knowledge dramatically improves performance on knowledge-based visual question answering.
Current multimodal LLMs struggle with guideline-constrained clinical reasoning, but a simple multi-agent framework can significantly boost their performance on real-world lung cancer diagnosis and treatment.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
Existing multimodal sentiment analysis models crumble under real-world noise, but QA-MoE leverages uncertainty to dynamically route inputs, achieving robust performance across a continuous spectrum of data quality.
VLA models, seemingly robust, crumble when faced with diverse linguistic variations, as a new red-teaming approach reveals a staggering drop in task success from 93% to just 6%.
Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.
Current multimodal models can't handle the rapid-fire tactical analysis required for boxing commentary, as revealed by a new dataset and evaluation framework.
Current multimodal dialogue systems can't capture the subtle expressiveness of human interaction, as revealed by a new benchmark dataset of movie and TV dialogues.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.