Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.
By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.
By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.
The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.
Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.
Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.
Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.
Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.
Naive fine-tuning of VLMs for multimodal sequential recommendation causes catastrophic modality collapse, but can be fixed with gradient rebalancing and cross-modal regularization.
Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.
Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.
Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.
Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.
Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.
Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.
Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.
MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.
Achieve state-of-the-art semantic 3D reconstruction from sparse views by intelligently pruning redundant Gaussians and blending 2D and 3D semantic cues.
Ruyi2.5 achieves comparable performance to Qwen3-VL on general multimodal benchmarks while significantly outperforming it in privacy-constrained surveillance, demonstrating the effectiveness of its edge-cloud architecture.
Simply translating symbolic sign language notations into natural language unlocks significantly better motion generation when conditioning on phonological attributes with CLIP.
Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.
Synthesizing realistic 6-DOF object manipulation trajectories in complex 3D environments just got a whole lot better with GMT, a multimodal transformer that substantially outperforms existing methods.
Unlock automated creation of production-ready 3D assets from untextured meshes with TAPESTRY, which generates geometrically consistent turntable videos that can be back-projected into UV textures or used to supervise neural rendering.
Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.
By disentangling semantic and contextual cues in vision-language models, PCA-Seg achieves state-of-the-art open-vocabulary segmentation with only 0.35M additional parameters per block.
Radiologist dictation, combined with foundation models and minimal parameter updates, can achieve state-of-the-art MRI brain tumor segmentation.
Achieve high-fidelity transparent text animations from image-to-video models without retraining the VAE, sidestepping data scarcity and latent pattern mixing issues.
Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.
Reconstructing realistic 3D human crowds from a single image is now possible, thanks to a new method that cleverly handles occlusions and appearance variations.
Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.
By cleverly using readily available video segmentation masks, this method boosts DINOv2's point tracking performance by over 14% – a surprisingly effective way to inject temporal awareness into static image-pretrained models.
VLN agents can navigate more effectively by predicting their future states and proactively planning based on forecasted semantic map cues, rather than relying solely on historical context.
Forget training wheels: GoalVLM lets multi-agent robots navigate to any object you describe, no pre-programmed categories needed.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.
Instead of forcing modalities to imitate each other, IIBalance lets each modality contribute according to its intrinsic information budget, leading to better multimodal fusion.
Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.
Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.
VLMs don't fail to *recognize* harmful intent when jailbroken; instead, visual inputs *shift* their internal representations into a distinct "jailbreak state," opening a new avenue for defense.
Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.
Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.
Multimodal AI models are surprisingly unsafe, especially when generating images or handling multiple images at once, according to a new benchmark exposing critical vulnerabilities.
By probabilistically fusing visual context into text prompts, VirPro closes the semantic gap in weakly-supervised 3D detection, boosting performance by nearly 5% on KITTI.
Unlock the power of MLLMs for structured data like human skeletons with a differentiable rendering approach that allows end-to-end training.
By fusing IMU-derived egomotion with visual data, Motion-MLLM lets MLLMs achieve SOTA 3D scene understanding with 40% less compute.
By unifying layout-to-image generation and image grounding with a novel cycle-consistent learning approach, EchoGen achieves state-of-the-art results in both tasks, proving that solving two problems at once can be better than solving them separately.
Forget finetuning: DynaEdit unlocks complex video edits like action modification and object insertion, all without training, using clever manipulation of pretrained text-to-video models.
Forget waiting minutes for iterative optimization – Omni-3DEdit performs diverse 3D editing tasks in a single forward pass.
Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.
By adaptively calibrating facts and augmenting emotions, FACE-net overcomes the factual-emotional bias that plagues emotional video captioning.
A new prompt-free medical image segmentation model achieves impressive zero-shot and cross-modal transfer performance by explicitly disentangling geometric and semantic anatomical knowledge.
Skip the costly training and go straight to open-vocabulary 3D reasoning with ReLaGS, which builds a 3D semantic scene graph from language-distilled Gaussians.
Overcome weather limitations in remote sensing with MM-OVSeg, a multimodal Optical-SAR fusion framework that enables robust open-vocabulary segmentation even under cloudy conditions.
Grabbing two keyframes per shot – one for the gist, one for the glitch – lets you compress videos for VLMs without missing critical anomalies.
RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.
RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.
Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.
Surprisingly, you can achieve smooth, controllable image editing in text-to-image models without any training, just by intelligently nudging the text embeddings.
Gesture-aware pretraining unlocks significant improvements in 3D hand pose estimation, proving that semantic gesture information acts as a powerful inductive bias.
Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.
Reconstructing complete, animatable 3D avatars from heavily occluded YouTube videos is now possible, thanks to a hallucination-as-supervision pipeline using diffusion models.
Medical vision-language models perform better when the modality gap is tuned to an intermediate level, challenging the assumption that minimizing it is always optimal.
Turning past programming failures into reusable knowledge boosts automated repair performance by 3.7% on a multimodal benchmark.
By focusing on semantic differences between scans, DiffVP lets LLMs generate more accurate CT reports without needing explicit lesion localization.
An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.
Symphony's cognitively-inspired multi-agent system significantly boosts long-form video understanding by mimicking human reasoning, achieving state-of-the-art results on multiple benchmarks.
Forget collapsing videos into text – this hierarchical grid lets you zoom into any moment with lossless visual fidelity, unlocking logarithmic compute scaling for long-form video understanding.
Video fine-tuning boosts MLLMs' video smarts, but surprisingly dumbs them down on static images – a trade-off you can't simply brute-force away with more frames.
Achieve more precise robot control by explicitly disentangling high-level goals from low-level kinematic instructions.
Concept erasure in text-to-image models is mostly smoke and mirrors: a text-free attack can still regenerate "forgotten" concepts by exploiting the model's latent visual knowledge.
VLMs struggle to reason about visual scenes in adverse weather, losing significant segmentation accuracy as rain, snow, or fog intensifies.
Achieve state-of-the-art performance in multimodal remote sensing semantic segmentation with significantly fewer trainable parameters by using a novel parameter-efficient and modality-balanced symmetric fusion framework.
LLMs can navigate complex 3D environments more effectively and with far fewer tokens by using a hierarchical scene graph representation derived from omnidirectional sensor data.
Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.
Policies trained on DexViTac's multimodal dataset achieve over 85% success in real-world dexterous manipulation, proving that high-fidelity tactile data unlocks a new level of robotic dexterity.
AdaZoom-GUI achieves SOTA GUI grounding by adaptively zooming in on small elements and refining ambiguous instructions, outperforming even larger models.
VLMs can now drive embodied agents to navigate complex environments with unprecedented efficiency, thanks to a novel framework that bridges the gap between 2D semantic understanding and 3D spatial reasoning.
Don't let your robot's brief moment of panic get lost in the noise – this new uncertainty method spotlights those critical spikes to predict failures before they happen.
Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.
Human-robot teams can get a boost: eye-tracking data alone can predict when a human teammate is struggling to understand the robot's situation with nearly 90% recall.
Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.
Achieve object-level motion control in image-to-video generation without any training by cleverly exploiting attention maps and first-last-frame priors.
Normalizing flows can flag anomalous relationships in scene graphs with 10% better accuracy and 5x faster speed than existing methods, while also exhibiting superior robustness to semantic variations.
Compressing images into 1D token sequences can yield state-of-the-art reconstruction fidelity, challenging the necessity of 2D spatial grids for visual tokenization.
Text-heavy fine-tuning is blinding your MLLM to crucial 3D spatial information, but GAP-MLLM's geometry-aligned pre-training can restore its sight.
Stop blindly steering all layers of your LVLM - this new method uses attribution to apply targeted interventions only where hallucinations originate, preserving performance on general tasks.
Diffusion models can now capture nuanced semantic and material details in image stylization, moving beyond simple color-driven transformations, thanks to a Mixture of Experts architecture.
Fine-tuning Vision-Language Model planners for robotic manipulation is now significantly more efficient and safer thanks to a novel framework that leverages video world models to simulate real-world physics.
Autonomous vehicles can now see through the storm: a new Mixture of Experts approach boosts 3D object detection accuracy by 15% in adverse weather, without slowing things down.
Open-source VLMs can be easily fooled by simple gradient-based attacks, but the degree of vulnerability varies drastically across architectures.
A multimodal pipeline integrating vision, OCR, and LLMs can achieve state-of-the-art content moderation performance at significantly lower latency than existing methods, especially for threats embedded in text.