Search papers, labs, and topics across Lattice.
100 papers published across 9 labs.
LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.
Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.
Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.
Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.
Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.
LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.
Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.
Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.
Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.
Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.
VLA-controlled robots can now detect anomalies in under 100ms using a plug-and-play module, enabling real-time recovery from unexpected situations.
Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.
Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.
Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.
LVLMs can now provide depth-aware pedestrian navigation guidance by grounding language reasoning and segmentation, without needing user-provided cues or anchor points.
Explicitly aligning audio and video streams in a multimodal Transformer boosts emotion recognition, showing that ignoring frame-rate differences hurts performance.
Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.
Forget catastrophic forgetting: this imitation learning framework remembers up to 65% more while improving AUC by 10-17 points on the LIBERO benchmark.
Achieve robust humanoid task execution in complex environments by turning high-level language instructions into verifiable, geometrically-grounded task programs that can recover from failures.
Speech tokenizers, despite being crucial for multimodal LLMs, primarily capture phonetic information, creating a semantic mismatch with text-derived semantics that hinders performance.
This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.
Ditch discrete visual tokens: UniCom achieves SOTA multimodal generation by compressing continuous semantic representations, unlocking better controllability and consistency in image editing.
Achieve 2.5x higher success in UAV navigation by decoupling target generation from progress monitoring, enabling safer and more efficient zero-shot flight.
A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.
Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.
Generative AI's ability to reason about and refine images based on authenticity criteria inadvertently creates a powerful evasion strategy that renders current deepfake detectors ineffective.
A training-free visual distillation method boosts VLA model performance in cluttered environments by over 34%, proving that targeted noise reduction is more effective than brute-force scaling.
Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.
By decoupling visual and motor information during pretraining, FutureVLA unlocks more effective visuomotor prediction for vision-language-action models, boosting performance without modifying downstream architectures.
By jointly modeling video dynamics and actions, DiT4DiT achieves 10x sample efficiency and 7x faster convergence in robot policy learning, showing that video generation can be a powerful scaling proxy.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
Vision-language models can significantly enhance language models through knowledge distillation, even without direct textual understanding, challenging conventional KD paradigms.
Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.
Autonomous vehicles can now better "see" the world even when cameras fail, thanks to a new method that fills in the blanks by leveraging spatial overlaps and learned semantic priors.
Skip expensive manual annotation: this method extracts accurate 3D UAV trajectories and classifications directly from readily available internet videos.
Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
By converting point clouds into a format VLMs can understand, VLM-Loc significantly boosts text-to-point-cloud localization accuracy, outperforming prior methods that rely on shallower text-point cloud correspondences.
Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.
A 4B-parameter model, InternVL-U, outperforms 14B-parameter models in multimodal generation and editing, proving that size isn't everything.
Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.
Forget manual labeling: STONE offers a massive, automatically-labeled dataset for off-road robot navigation, unlocking scalable training for robust 3D traversability prediction.
Ditch the map: a diffusion model learns to plan UAV swarm trajectories directly from RGB images, enabling reactive and adaptive navigation in cluttered environments.
Human-in-the-loop learning can now boost dexterous manipulation VLA models by 25%, thanks to a new framework that smartly samples corrective actions and enables real-time intervention.
Explicitly teaching LVLMs to reason step-by-step with reinforcement learning unlocks state-of-the-art performance on multimodal object-entity relation extraction.
Achieve SOTA multi-modal object tracking by adaptively fusing modalities with a Mixture of Experts and decoupling temporal propagation with separate State Space Models.
By explicitly bridging the gap between on-body appearances and flat layouts, BridgeDiff achieves state-of-the-art virtual try-off results, generating more realistic and structurally sound flat-garment representations.
Unlock real-time semantic SLAM and multimodal interaction with 3D Gaussian Splatting using X-GS, a unified and extensible open framework.
Steer clear of catastrophic forgetting in VLMs with EvoPrompt, a new method that evolves prompts by preserving learned semantic directions while adapting their magnitude.
Large models are emerging as a promising new paradigm for translating complex-layout document images, as shown by the ICDAR 2025 DIMT competition.
LVLMs can be jailbroken by "Reasoning-Oriented Programming," which chains together harmless visual inputs to trigger harmful reasoning, much like return-oriented programming in traditional security exploits.
By explicitly modeling how abnormalities relate within and across different medical image views, GIIM achieves significantly higher diagnostic accuracy and robustness, even with incomplete data.
Skip the expensive proxy model training: this training-free method boosts VLLM performance by up to 4.8% using only 10-15% of the data, simply by measuring how much the question *changes* the model's view of the answer.
LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.
Controllable emotion style transfer in speech is now possible without needing paired data, opening new avenues for data augmentation and expressive AI.
Forget retraining: Ego personalizes VLMs on the fly by extracting and leveraging visual tokens that represent specific concepts using the model's internal attention.
A 4B-parameter model outperforms Gemini-3-Pro in autonomous driving by incorporating physics-informed constraints and style-aware training, suggesting specialized models can surpass larger, general-purpose models in domain-specific tasks.
VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.
Forget blurry sketch-to-image outputs: this method uses component-aware self-attention and coordinate-preserving fusion to generate photorealistic images with unprecedented fidelity and spatial accuracy.
Finally, a GelSight-style sensor that doesn't force you to choose between pre-contact vision and high-fidelity tactile sensing.
Ditch the flat scene graphs: TopoOR models surgical environments as higher-order topological structures, unlocking superior performance in safety-critical tasks by preserving complex relationships and multimodal data.
Precisely steer text-to-image generation along cognitive dimensions like valence and memorability with CogBlender, a framework that lets you dial in psychological intent.
Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.
Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.
Unlock the power of web videos for embodied AI: implicit geometry representations let agents learn to navigate from real-world room tours without relying on fragile 3D reconstruction.
By representing visual inputs as 3D Gaussian primitives, GST-VLA unlocks a new level of geometric understanding for vision-language-action models, leading to substantial performance gains in robotic manipulation tasks.
Unlock realistic acoustic simulations with a text prompt: fine-tuning a text-to-audio model generates plausible room impulse responses, even with limited paired data.
Reverse image search, a key tool for visual fact-checking, often amplifies misinformation and irrelevant content, burying debunking information.
Domain-specific biosignal foundation models, fused with multimodal ECG and PPG data, substantially outperform general time-series models on clinically relevant tasks, but bigger isn't always better.
Imagine writing a script and instantly seeing it come to life – Doki makes generative video authoring as intuitive as writing a text document.
A new large-scale dataset could jumpstart Vietnamese VQA research by providing a crucial resource for training and evaluating multimodal models in a low-resource language.
MLLMs still struggle to reliably predict the long-term consequences of actions in egocentric videos, even with structured scene annotations.
VLMs can now self-evolve from *zero* data, thanks to a multi-agent RL framework that synthesizes its own visual concepts and reasoning tasks.
A robot can now achieve 90% success in peg-in-hole tasks, even with only 0.1mm clearance, by intelligently fusing vision and tactile feedback when visual occlusion occurs.
Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.
Combining pre-trained and custom neural networks with data augmentation and transfer learning yields a robust autonomous driving system capable of accurately perceiving and reacting to its environment.
Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.
Provably secure steganography can now withstand real-world image compression and processing thanks to a clever latent-space optimization technique.
Medical multi-agent systems can reason deeply, but fall apart when switching between medical specialties, highlighting a critical need for more robust architectures.
LLMs can drive pedagogical agents to be more engaging and effective by dynamically generating speech and gestures that align with the semantic context of instructional content.
Panoramic vision-language models can achieve a level of holistic scene understanding and robustness in adverse conditions that's impossible for traditional pinhole-based VLMs.
Robots can now recover from failures during manipulation tasks by explicitly tracking progress against spatial subgoals, without needing extra training data or models.
Adapt your action anticipation model on-the-fly to new viewpoints (egocentric or exocentric) with a novel test-time adaptation method that leverages multi-label prototype growing and dual-clue consistency.
Ditch global embeddings for text-motion retrieval: this method uses joint-angle motion images and token-patch late interaction to achieve state-of-the-art accuracy and interpretability.
Generate more realistic and nuanced human movements from text by explicitly modeling individual body parts, overcoming the limitations of existing holistic approaches.
Skip the costly policy training: this zero-shot method nails text-goal instance navigation by grounding language in 3D geometry for smarter exploration and verification.
Current AI models fall short when asked to understand a situation from the combined perspectives of multiple embodied agents, as revealed by a new challenging benchmark.
FetalAgents leapfrogs existing fetal ultrasound analysis tools by dynamically orchestrating specialized AI agents, outperforming monolithic models across diverse clinical tasks and delivering structured clinical reports from video streams.
Multimodal models that seem robust can still fail when some modalities are systematically missing, a problem MissBench exposes with new metrics for modality equity and learning balance.
By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.
By translating visual observations into language, LAP achieves state-of-the-art procedure planning by disambiguating visually similar actions, outperforming vision-only methods.
By injecting symbolic reasoning into vision-language-action models, NS-VLA achieves remarkable gains in data efficiency and generalization for robotic manipulation.
By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.
A single spatial token, learned via occupancy prediction on a massive dataset, is surprisingly effective at injecting crucial spatial awareness into vision-language navigation, leading to state-of-the-art performance.
MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.
By having a single VLM critique its own SVG renderings, IntroSVG learns to generate more complex, semantically aligned, and editable vector graphics from text prompts.
Forget training separate models for different field-of-views in geo-localization — SinGeo achieves SOTA robustness with a single model, even outperforming specialized architectures.
Stop letting sparse rewards bottleneck your VLN agent: SACA disentangles failed trajectories into valid prefixes and divergence points for dense supervision, unlocking SOTA performance.
Unlock scalable, privacy-sensitive image steganography with MIDAS, a training-free diffusion framework that grants user-specific access control to hidden multi-image content.
Even with 80% of brain scan data missing, ACADiff can accurately generate the missing modalities and maintain robust diagnostic performance for Alzheimer's disease.
Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.
Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.
Text-only foundation models can perform surprisingly well on complex 3D spatial reasoning tasks, rivaling multimodal models, when equipped with a structured spatial representation derived from 3D reconstruction.
Ditch slow, iterative ODE solvers for robot control: this method distills flow-based policies into a single-step model that's fast enough for real-time replanning without sacrificing multi-modal action diversity.