Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.
Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.
Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.
Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.
Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.
Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.
Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.
Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.
Forget generating plausible-but-fake details: 3DreamBooth bakes a robust 3D prior into video generation models using only a single-frame optimization, enabling truly view-consistent customized subject videos.
Skip reinforcement learning and still get SOTA vision-language reasoning performance with a simple loss re-weighting scheme that cuts training time by 7x.
A million-sequence, high-quality, open-source motion dataset finally lets text-to-motion models generalize beyond toy benchmarks.
Fine-tuning a visual geometry transformer with SEAR unlocks surprisingly accurate RGB-Thermal 3D reconstruction, even surpassing SOTA methods despite training on significantly less multimodal data.
Closed-loop feedback using VLMs can dramatically improve text-to-image generation quality, even without additional training.
LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.
Current VLMs struggle with multi-hop spatial reasoning, often failing to compose even simple spatial relations across multiple steps, highlighting a critical gap for real-world VLA agent deployment.
Strategic visual aids are the secret weapon for geometric reasoning, and this work shows how to teach MLLMs to wield them effectively via reinforcement learning.
Skip annotating image rationales: this method transfers text-based rationales to images for explainable crisis classification, saving annotation effort while boosting performance.
Ditch the syntax-only grind: a multi-modal assessment strategy proves that introductory programming courses can boost both coding skills and crucial soft skills like communication and critical thinking.
Synthesizing realistic room acoustics from a single recording is now possible, thanks to a novel flow-matching approach that captures the uncertainty inherent in acoustic environments.
Forget waiting a minute for garment generation: SwiftTailor slashes inference times while boosting accuracy by representing 3D garments as geometry images.
Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.
Forget brute-force scaling: intelligently selecting just 1% of video frames can actually *improve* video QA accuracy and cut compute by 93%.
Fine-tuning LVLMs on counting alone boosts general visual reasoning by over 1.5%, revealing counting as a surprisingly central skill.
VLAs aren't just memorizing training data; sparse autoencoders reveal a hidden layer of generalizable motion primitives that can be steered to control robot behavior across tasks.
Multimodal LLMs suffer a major performance hit when asked to switch from text-based to image-based tasks mid-conversation, revealing a surprising asymmetry in their ability to handle task interference.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.
Achieve more physically realistic video generation by explicitly modeling 3D geometry and physical attributes across multiple viewpoints.
VLMs can now better detect when they're seeing something they shouldn't, even as the world changes around them, thanks to a new method that dynamically fuses visual and textual cues.
VLMs selectively ignore visual information based on question framing, even when the visual reasoning task remains identical, highlighting a critical vulnerability in their grounding capabilities.
MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.
Get faithful and plausible natural language explanations for chest X-rays with as few as 5 human-annotated examples per diagnosis, and even boost classification accuracy in the process.
Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.
VLMs struggle with spatial reasoning, but a clever decomposition into sub-problems and probabilistic recombination unlocks significantly better metric-semantic grounding.
Unlocking fairer vision-language models may be as simple as intervening in the sparse latent space of a sparse autoencoder, enabling targeted bias removal without harming performance.
Forget generic textures – CustomTex lets you clone real-world object appearances onto your 3D scenes with uncanny fidelity.
Proactive VideoLLMs can finally be both accurate AND efficient thanks to a novel propose-match framework that decouples semantic understanding from streaming perception.
LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.
A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.
Achieve state-of-the-art panoramic depth estimation without any training by cleverly exploiting the 3D consistency priors embedded within existing vision foundation models.
AI can now handle the tedious copywriting and real-time Q&A for live-streaming commerce, freeing up human streamers to focus on engagement.
Turns out, VLA models are mostly just looking at the scene: visual pathways dominate action generation, and language only matters when the visuals are ambiguous.
Automating motor insurance from vehicle damage analysis to claims evaluation is now possible with a vertically integrated AI paradigm.
DriveTok achieves unified multi-view reconstruction and understanding by learning scene tokens that integrate semantic, geometric, and textural information, outperforming existing 2D tokenizers in autonomous driving scenarios.
State Space Models can outperform Vision Transformers as vision encoders in VLMs, particularly when model size is a constraint.
Diffusion models can now generate rare concepts and execute complex edits with greater fidelity, thanks to a training-free prompt blending technique that leverages statistical properties of the diffusion process itself.
Ditch the finetuning: this training-free method uses attention scores to generate rare concepts in images with more precision and control than LLM-guided approaches.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Real-time robotic perception just got a major upgrade: OnlinePG achieves open-vocabulary panoptic mapping with 3D Gaussian Splatting, enabling robots to understand and interact with environments in a way that was previously impossible.
Synthesized PET scans from MRI can nearly match the diagnostic accuracy of real PET for Alzheimer's, potentially unlocking wider access to crucial functional insights.
Visual language models can now explicitly reason about object trajectories in videos, thanks to a simple yet effective method that augments training data and uses discrete motion tags.
LVLMs can gain a surprising amount of spatial reasoning ability by explicitly generating segmentation and depth tokens before answering questions.
Radiometric disentanglement from a single image becomes tractable by exploiting the shared illumination constraint across multiple objects, enabling stochastic sampling of reflectance, texture, and illumination.
Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.
Ditch the mask decoder: a single segmentation token can unlock competitive image segmentation directly from MLLMs.
Reconstructing realistic hand-object interactions from video just got an order of magnitude faster, thanks to a novel Gaussian Splatting approach that ensures physical consistency.
Automating linguistically-grounded sign language annotation is now possible, unlocking scalable dataset curation previously limited by manual effort.
Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.
Overcoming occlusion in hand-object pose estimation just got easier: GenHOI leverages hierarchical semantic knowledge and hand priors to achieve state-of-the-art results on challenging benchmarks.
Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.
Humanoid robots can now generate more empathetic and instruction-aware gestures thanks to a new diffusion framework conditioned on affective estimation and pedagogical reasoning.
Forget painstakingly designing simulation environments: generative 3D world models let you RL-fine-tune robot VLAs with massive scene diversity, boosting real-world transfer by 3x.
Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.
Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.
Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.
Flow-based VLAs can react to environmental changes ten times faster by adaptively prioritizing near-term actions during sampling, unlocking unprecedented real-time responsiveness.
High-dimensional discrete tokens, previously out of reach for generative models, can now be directly generated, unlocking a unified token prediction paradigm for multimodal architectures.
Text-to-3D generation gets a semantic upgrade: DreamPartGen creates 3D objects with parts that not only look right but also understand their relationships and align with textual descriptions.
VLMs' safety judgments are easily manipulated by simple semantic cues, revealing a reliance on superficial associations rather than true visual understanding.
Seemingly efficient VLA models can be surprisingly inefficient when deployed on robots, highlighting the need to move beyond standard metrics like FLOPs and parameters.
SLMs are shockingly vulnerable: combining adversarial audio and text unlocks 1.5x to 10x higher jailbreak rates than attacking either modality alone.
Multi-modal federated learning can be made communication-efficient and robust to outliers by learning a shared latent space, even with heterogeneous client architectures.
Spatial awareness is the secret ingredient to unlocking better visual in-context learning, boosting performance across diverse vision tasks.
MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.
Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.
Unlock geometry-precise 3D generation by directly conditioning diffusion models on readily available point cloud priors, outperforming existing image- or text-conditioned methods.
Hierarchical memory, inspired by human cognition, beats standard approaches in robotic manipulation tasks requiring both precise tracking and long-term retention.
Robots can now manipulate objects with greater dexterity and adaptability thanks to a new world model that leverages both vision and high-frequency tactile feedback to predict and react to contact dynamics.
Medical vision-language models are surprisingly brittle: clinically plausible image manipulations, like those introduced during routine acquisition and delivery, can drastically degrade their performance.
Tactile sensing closes the sim2real gap for deformable object tracing, enabling a single imitation learning model to achieve impressive generalization across diverse objects.
Even the most advanced VLMs like GPT-4o, GPT-5 and Gemini 2.5 Flash are outperformed in multi-actor human-robot interaction grounding by a system that selectively invokes VLMs based on a lightweight perception pipeline.
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.
The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.
Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.
Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.
Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.
Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.
Naive fine-tuning of VLMs for multimodal sequential recommendation causes catastrophic modality collapse, but can be fixed with gradient rebalancing and cross-modal regularization.
Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.
Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.
Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.
Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.
Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.
Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.
Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.
MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.