Search papers, labs, and topics across Lattice.
Models that process and generate across multiple modalities: vision-language, audio-text, and unified multimodal architectures.
#3 of 24
0
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Multimodal deep learning models for cancer prognosis may not be synergizing information across modalities as much as we think; better performance seems to come from simply adding complementary signals.
Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Achieve superhuman robot dexterity with 10x fewer demonstrations by decoupling intent and action through latent world modeling.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.
Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.
Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.
Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.
MLLMs are more vulnerable than we thought: imperceptible visual prompts can effectively hijack their behavior.
Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.
Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
Physical AI systems struggle not with visual recognition, but with understanding space, physics, and action – and PRISM, a new retail video dataset, dramatically closes this gap.
Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.
Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.
LVLMs aren't all that glitters: a new information-theoretic analysis reveals that some lean heavily on language priors while others genuinely fuse vision and language.
Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.
Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.
Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.
You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.
Multimodal repair isn't always better: selectively escalating to multimodal prompting based on runtime signals in Scratch yields a superior success-cost-energy tradeoff compared to uniformly applied multimodal approaches.
Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.
Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.
Even state-of-the-art VLMs exhibit systematic failures in reasoning about the physical feasibility of actions in 3D environments, despite high semantic confidence.
Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.
Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.
Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.
By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.
Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.
Masked motion generators struggle with complex movements because they treat all frames the same – until now.
Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
Querying satellite imagery just got easier: EarthEmbeddingExplorer lets you find images using text, visuals, or location, unlocking insights previously trapped in research papers.
Finally, a blind face restoration method that doesn't just hallucinate details, but lets you precisely control facial attributes via text prompts while maintaining high fidelity.
Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.
Ditching depth map projections for camera-LiDAR calibration unlocks significant gains in accuracy and robustness, especially when starting from poor initial extrinsic estimates.
Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.
LLMs can generate more accurate motion trajectories by clustering them into geometrically consistent families, even without retraining.
Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.
Stop segmenting remote sensing images in isolation: modeling inter-unit dependencies boosts open-vocabulary segmentation accuracy by up to 6%.
Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.
Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.
Forget tedious optimization – LightHarmony3D generates realistic lighting and shadows for inserted 3D objects in a single pass, making scene augmentation feel truly real.