Search papers, labs, and topics across Lattice.
75 papers published across 7 labs.
Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.
Reconstructing 3D animals in the wild just got a whole lot easier, even in crowded and occluded scenes, thanks to a new promptable framework.
OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.
Reconstructing 3D animals in the wild just got a whole lot easier, even in crowded and occluded scenes, thanks to a new promptable framework.
OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.
Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.
Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.
Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.
Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.
A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.
Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.
Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.
Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.
Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.
Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.
Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.
LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.
Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.
Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.
Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.
By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.
Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.
Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.
Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.
Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.
Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.
Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.
Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.
Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.
Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.
Make your prompts 5x more interpretable without hurting accuracy: IPL combines discrete token selection with continuous optimization, and it's plug-and-play with existing methods.
Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.
Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.
Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.
Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.
Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.
Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.
Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.
Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.
Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.
Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.
By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.
Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.
Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.
FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.
ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.
Current video generation benchmarks overlook crucial aspects of physical plausibility and temporal coherence, highlighting the need for holistic evaluation metrics like PhyScore.
Get expert-level feedback on your performance, not just a score, thanks to a new approach that uses language generation for proficiency estimation.
RLDX-1 achieves double the success rate of existing VLAs on complex humanoid tasks, suggesting a leap in robots' ability to handle contact-rich, dynamic manipulation.
Bidirectional interaction between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables a unified multimodal model to achieve spatial intelligence beyond general visual competence.
A hierarchical agent that separates visual and textual contexts drastically improves multi-step reasoning on complex charts, outperforming monolithic MLLMs.
Automating materials science database construction is now feasible: a multi-agent system extracts structured data from scientific literature with high speed and accuracy.
Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.
LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.
Production VLMs like GPT-4, Claude Opus, Gemini, and Grok can be easily manipulated into confidently providing false information via subtle adversarial perturbations to images, even without compromising model alignment.
Event cameras can significantly boost the reliability of autonomous driving in high-dynamic-range and high-speed scenarios, achieving perfect route completion in CARLA benchmarks.
Guaranteeing safe robot navigation in unstructured environments just got easier: translate human language rules into formal logic, ground them with VLMs, and let the robot navigate.
Robot video world models can be significantly improved by distilling a multimodal reward function and stabilizing long-horizon inference, leading to better instruction following and manipulation accuracy.
Robots can now learn manipulation skills from human videos with greater morphological accuracy and temporal consistency, thanks to a new method that disentangles task and embodiment.
Achieve scalable open-vocabulary semantic maps of entire buildings by fusing both dense and instance-level semantic information in a novel dual-layer voxel representation.
Unlock agile humanoid robots by ditching teleoperation and training directly from human VR demos.
Ditch the GPS: This CVGL pipeline achieves a 5.9x improvement in localization accuracy over IMU-only by intelligently fusing satellite imagery with inertial measurements, needing only a single initial GPS fix.
Multimodal graph unlearning doesn't have to destroy utility: carefully protecting high-dimensional input projections during the unlearning process preserves performance while still enabling effective forgetting.
Conformal prediction offers a surprisingly effective way to handle both modality imbalance and noisy corruption in multimodal learning by explicitly modeling predictive uncertainty during training.
Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.
Open-sourcing a VLA model that beats closed-source giants on embodied reasoning tasks could finally make real-world robot deployment practical.
LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.
Achieve state-of-the-art object detection in multi-camera surveillance without compromising data privacy by fusing models trained on synthetically augmented and federated data.
Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.
Encoding temporal prediction into video VAEs unlocks faster training, better generative performance, and improved downstream task performance, all at once.