Search papers, labs, and topics across Lattice.
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Quadrupedal robots can now perform dynamic loco-manipulation in the real world, matching human teleoperation, using only onboard ego-centric vision and a low-frequency (5Hz) open-vocabulary detector.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.
Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.
Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.
Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.
Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.
Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.
Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.
Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.
Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
By unifying generative and discriminative approaches, UniGenDet achieves superior image generation and detection, suggesting that these tasks benefit from a symbiotic relationship previously hindered by architectural divergence.
Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.
LLMs can now predict where drivers look with uncanny human-like accuracy, thanks to a new dataset and architecture that grounds attention in objects, not just scenes.
Training-free diffusion models can now harmonize satellite imagery across diverse domains, enabling scalable remote-sensing synthesis without retraining.
Stop fragmented land cover predictions: SSDM leverages global geospatial embeddings to guide local feature extraction, achieving state-of-the-art performance in high-resolution remote sensing mapping.
Freezing a Stable Diffusion backbone and injecting CLIP and BLIP features lets you beat the state-of-the-art in zero-shot sketch-based 3D shape retrieval, without any costly retraining.
MV-HGNN achieves superior 3D shape retrieval by effectively leveraging geometric dependencies and semantic alignment, outperforming existing methods in zero-shot settings.
VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Targeted neuron fine-tuning can unlock superior image translation capabilities in multimodal large language models, outperforming traditional methods by preserving pre-trained knowledge.
Autoregressive 3D layout generation can be both more physically plausible and significantly faster by repurposing existing 3D generative models.
Forget relying on fickle visuals: this new ReID method uses language to describe *who* a person is, not just what they look like, and it crushes existing benchmarks.
Synthesizing realistic anomaly images for industrial assembly is now possible thanks to a diffusion model that respects component pose and assembly relationships.
Achieve photorealistic, identity-consistent facial video edits from text prompts without video training data, rivaling traditional rendering software.
Imagine creating high-fidelity, navigable 3D worlds from just a text prompt or a single image – HY-World 2.0 makes it a reality.
Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.
Achieve superior 3D scene reconstruction from aerial images with significantly reduced transmission overhead by directly optimizing communication for rendering quality.
By explicitly modeling both consensus and discrepancy between RGB and IR data, this text-guided multispectral object detector significantly boosts performance on multispectral benchmarks.
Finally, a model that speaks fluent Lottie: LottieGPT generates editable vector animations directly from text or images, opening up a new frontier for resolution-independent, compact, and semantically structured multimedia creation.
Achieve state-of-the-art object detection accuracy and efficiency by fusing RGB frames and event streams with a sparse hypergraph and a fine-grained mixture of experts, enabling real-time edge deployment.
Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.
Robots can now focus on the *right* body parts for interaction, thanks to a new vision-language model that understands human motion commands and precisely localizes task-relevant 3D keypoints.
Achieve real-time (40 FPS at 720p) interactive video generation with minute-long memory consistency using a 5B parameter world model.
Medical MLLMs, despite their size and training data, stumble on basic image classification due to four key failure modes, revealing a disconnect between hype and clinical readiness.
Turns out, you can cut critical errors in VLM-generated image editing instructions in half with a clever two-stage training pipeline, leading to SOTA editing performance.
LLMs can now leverage visual structure, not just text, to pinpoint bugs in multimodal programs, thanks to a novel graph alignment approach that bridges the gap between GUI screenshots and code.
Synthesizing realistic anomalies for industrial inspection is now possible with just a few examples, thanks to spatially-grounded diffusion that outperforms existing inpainting techniques.
Ditch the slow per-scene optimization: SurfelSplat reconstructs surfaces from sparse views in under a second, matching state-of-the-art accuracy with a 100x speedup.
Synthesizing novel views from extrapolated poses no longer requires dense supervision, thanks to a geometry-conditioned diffusion model that explicitly learns to handle out-of-trajectory artifacts.
Achieve state-of-the-art real-world image dehazing by jointly reconstructing the clear scene and scattering variables, even with non-uniform haze and complex lighting.
Achieve state-of-the-art metal artifact reduction in CT images with MARMamba, a Mamba-based model that's both lightweight and preserves anatomical structure.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.
Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.
Finally, underwater SLAM can produce photorealistic maps thanks to a novel medium-aware Gaussian map representation.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.