Search papers, labs, and topics across Lattice.
Models that process and generate across multiple modalities: vision-language, audio-text, and unified multimodal architectures.
#11 of 24
2
OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.
Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.
Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.
Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.
Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.
A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.
Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.
Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.
Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.
Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.
Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.
Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.
LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.
Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.
Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.
Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.
By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.
Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.
Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.
Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.
Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.
Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.
Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.
Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.
Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.
Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.
Make your prompts 5x more interpretable without hurting accuracy: IPL combines discrete token selection with continuous optimization, and it's plug-and-play with existing methods.
Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.
Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.
Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.
Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.
Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.
Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.
Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.
Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.
Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.
Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.
By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.
Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.
Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.
FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.
ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.