Search papers, labs, and topics across Lattice.
Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.
StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.
MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.
AI can now handle the tedious copywriting and real-time Q&A for live-streaming commerce, freeing up human streamers to focus on engagement.
Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.
By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.
Directly modeling 3D geometry in dental scans unlocks a 9.58% accuracy boost in multi-disease diagnosis compared to methods relying on 2D or multi-view image representations.
By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.
DriveFix tackles the "shaky camera" problem in 4D driving scene reconstruction, producing significantly more stable and coherent novel views by explicitly modeling spatio-temporal dependencies.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
Injecting physics-based priors derived from MLLMs at decoding time significantly boosts weather forecasting accuracy and stability, even in long autoregressive rollouts.
MLLMs still can't handle time-sensitive multimodal reasoning, often failing to integrate auditory and visual cues effectively in dynamic environments like a 4D escape room.
A 2B parameter model trained on a new 1.1M dataset can now forecast remote sensing scenes better than Gemini-2.5-Flash Image, suggesting that task-specific training data and methods can beat sheer scale.
Autonomous driving models can learn to avoid accidents *before* they happen by training on expert interventions and anticipating errors.
Forget finetuning rare tokens: MoKus leverages cross-modal knowledge transfer to bind diverse textual knowledge to visual concepts, achieving high-fidelity customized generation.
By decoupling patch details from semantics, Cheers achieves state-of-the-art multimodal performance at 20% of the training cost of comparable models.
Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.
Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.
Floor plan generation gets a major upgrade with HouseMind, a multimodal LLM that uses discrete room-instance tokens to achieve unprecedented geometric validity and controllability.
Control both multi-subject identity and multi-granularity motion in video generation with DreamVideo-Omni, a framework that uses latent identity reinforcement learning to avoid identity degradation.
A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.
Forget training separate models for different field-of-views in geo-localization — SinGeo achieves SOTA robustness with a single model, even outperforming specialized architectures.
Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.
MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.
LLMs can significantly boost micro-expression recognition by reasoning about subtle facial muscle movements when guided by structured visual and relational prompts.
Stop predicting the future, start predicting *change*: $Δ$VLA guides robotic action by modeling how world knowledge *varies* under actions, not by forecasting absolute future states.
Text-to-image customization can now preserve the original model's behavior, thanks to a decoupled learning objective that balances new concepts with pre-existing capabilities.
Forget task-specific fine-tuning: TSEmbed unlocks SOTA multimodal embeddings by disentangling task objectives with a Mixture-of-Experts and a novel expert-aware negative sampling strategy.
Aura unlocks more accurate aviation time series forecasting by explicitly modeling how different types of external factors interact with temporal dynamics.
Finally, AI can generate hour-long videos with consistent characters and backgrounds, thanks to a new framework that nails seamless transitions between shots.
LLMs can achieve state-of-the-art audio-visual speech recognition by sparsely aligning modalities and refining with visual unit guidance, substantially boosting robustness in noisy environments.
Achieve state-of-the-art multimodal intent recognition by structuring semantics into progressively abstracted levels and dynamically refining representations through MLLM feedback.
By explicitly disentangling degradation and semantic features with wavelet attention, CWP-Net achieves superior all-in-one image restoration, outperforming previous methods hampered by spurious correlations and biased degradation estimation.
Ditch hard clipping: GIPO's Gaussian-weighted importance sampling offers a smoother, more stable RL policy optimization, especially when dealing with stale or limited data.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
By predicting latent features instead of pixels, PROSPECT achieves state-of-the-art VLN performance and long-horizon robustness without adding inference overhead.
Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.
Achieve state-of-the-art semantic scene understanding from sparse views with a feed-forward architecture that generalizes across diverse environments.
Achieve state-of-the-art image fusion and restoration in complex adverse weather by unifying infrared-visible fusion with compound degradation removal in a single Mamba-based model.
By disentangling structure and motion in the latent space, CoWVLA achieves superior visuomotor learning compared to standard world-model and latent-action approaches.
Multimodal jailbreaks, meet your match: SaFeR-ToolKit's virtual tool-calling protocol boosts VL model safety by up to 55% without sacrificing general capabilities.
AI-powered pathology slashes GTD diagnosis time by 71% while boosting accuracy, offering a lifeline for maternal health.
Achieve state-of-the-art monocular re-localization in OpenStreetMap by cleverly aligning image semantics with map data, enabling faster and more accurate localization than dense matching approaches.
Achieve real-time, drift-free online 3D reconstruction by decoupling memory into actively refreshed local geometry and a stable, persistent global structure.
Achieve 100% success rates in visually ambiguous manipulation tasks by fusing high-frequency tactile data with low-frequency visual planning, outperforming visual-only baselines and satisfying hard real-time constraints.