Search papers, labs, and topics across Lattice.
100 papers published across 3 labs.
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.
Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Multimodal deep learning models for cancer prognosis may not be synergizing information across modalities as much as we think; better performance seems to come from simply adding complementary signals.
Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Achieve superhuman robot dexterity with 10x fewer demonstrations by decoupling intent and action through latent world modeling.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.
Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.
Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.
Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.
MLLMs are more vulnerable than we thought: imperceptible visual prompts can effectively hijack their behavior.
Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.
Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.
AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
Physical AI systems struggle not with visual recognition, but with understanding space, physics, and action – and PRISM, a new retail video dataset, dramatically closes this gap.
Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.
Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.
LVLMs aren't all that glitters: a new information-theoretic analysis reveals that some lean heavily on language priors while others genuinely fuse vision and language.
Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.
Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.
Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.
You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.
Multimodal repair isn't always better: selectively escalating to multimodal prompting based on runtime signals in Scratch yields a superior success-cost-energy tradeoff compared to uniformly applied multimodal approaches.
Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.
Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.
Even state-of-the-art VLMs exhibit systematic failures in reasoning about the physical feasibility of actions in 3D environments, despite high semantic confidence.
Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.
Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.
Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.
By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.
Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.
Masked motion generators struggle with complex movements because they treat all frames the same – until now.
Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
Querying satellite imagery just got easier: EarthEmbeddingExplorer lets you find images using text, visuals, or location, unlocking insights previously trapped in research papers.
Finally, a blind face restoration method that doesn't just hallucinate details, but lets you precisely control facial attributes via text prompts while maintaining high fidelity.
Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.
Ditching depth map projections for camera-LiDAR calibration unlocks significant gains in accuracy and robustness, especially when starting from poor initial extrinsic estimates.
Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.
LLMs can generate more accurate motion trajectories by clustering them into geometrically consistent families, even without retraining.
Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.
Stop segmenting remote sensing images in isolation: modeling inter-unit dependencies boosts open-vocabulary segmentation accuracy by up to 6%.
Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.
Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.
Forget tedious optimization – LightHarmony3D generates realistic lighting and shadows for inserted 3D objects in a single pass, making scene augmentation feel truly real.
Diffusion-based image editing's impressive flexibility comes with fundamental trade-offs between controllability, faithfulness, consistency, locality, and quality, which this paper exposes with clear theoretical bounds.
Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.
Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.
Humanoids can now nimbly navigate real-world clutter and complex terrain using only raw depth data, ditching hand-engineered geometric representations.
Achieve state-of-the-art robotic manipulation with a model orders of magnitude smaller than VLAs by explicitly aligning kinematic and semantic transitions.
VLN agents can now "dream ahead" by learning action-conditioned visual dynamics in a latent space, leading to SOTA results and improved real-world navigation.
Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.
MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.
Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.
Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.
Style transfer can now capture the essence of artistic abstraction, not just surface-level appearance, by explicitly reinterpreting image structure.
Finally, a video generation model lets you roam through a scene with long-term spatial and temporal consistency, opening up new possibilities for virtual exploration.
Unleashing creative potential in text-to-image models just got easier: on-the-fly repulsion in the contextual space lets you steer diffusion transformers towards richer diversity without sacrificing image quality or blowing your compute budget.
Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.
Image generation takes a leap towards real-world knowledge by training an agent that actively searches for and integrates external information, substantially boosting performance on knowledge-intensive tasks.
Zero-shot Vision-Language Models can now guide chip floorplanning, beating specialized ML methods by up to 32% without any fine-tuning.
Current vision-language benchmarks miss the mark: AMIGO reveals how hard it is for agents to ground visual information across multiple images and turns.
Anticancer drugs, whether organic or inorganic, can now be understood through a single unified representation, unlocking knowledge transfer between previously siloed chemical domains.
VLMs can appear to gain up to 58% F1 on clinical tasks simply by *mentioning* MRI data in the prompt, even when the data is uninformative, revealing a "scaffold effect" that inflates performance metrics.
Multi-view learning with prototype-based correction significantly boosts the robustness of thyroid nodule ultrasound classification across different ultrasound devices and clinical environments.
VLA models are brittle: even simple synonym substitutions in instructions cause a 22-52% performance drop in robotic manipulation tasks.
Finally, a framework that unifies dynamic graph models, topological learning, and multimodal fusion to decompose health risk into interpretable components.
Get 80% of your oracle feedback for free: ROVED leverages vision-language embeddings to drastically reduce the need for human preferences in reinforcement learning.
Disentangling perception and reasoning with role-specific rewards in multimodal LLMs boosts accuracy by 7 points, revealing a critical bottleneck in existing joint optimization approaches.
Adversarial training unlocks domain-invariant prompts for CLIP, boosting zero-shot generalization beyond standard prompt tuning.
A 7B model trained on a new dataset of Chinese porcelain outperforms GPT-4 by 12% on expert connoisseurship tasks, demonstrating the power of domain-specific training and tool integration.
MLLMs can now guide visual generative models to imagine what's hidden behind objects, significantly boosting amodal completion performance.
Diffusion models can now predict driver attention with state-of-the-art accuracy by incorporating LLM-enhanced semantic reasoning.
PReD leaps ahead by creating the first foundation model to close the loop on perception, recognition, and decision-making for electromagnetic signals.
Open-source document parsing models are shockingly brittle, losing nearly 18% accuracy on real-world photos and 14% on non-Latin scripts compared to their closed-source counterparts.
VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.
Scientific figure QA models are often fooled by the answer choices themselves, but a simple decoding strategy that contrasts image-grounded scores with text-only scores can significantly improve accuracy.
Even state-of-the-art vision-language models still struggle to reconcile visual evidence with commonsense, often hallucinating based on prior knowledge instead of what they actually see.
LVLM inference is ripe for optimization, but current acceleration techniques only scratch the surface.
MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.
Forget expensive, low-realism 3D renders: diffusion models can now generate photorealistic human datasets that boost model performance beyond real-world data.
Achieve state-of-the-art image similarity generalization with a surprisingly simple, efficient, and interpretable model that operates on local descriptor correspondences.
VLMs can be devastatingly fooled by modifying less than 2% of image pixels in a fixed, X-shaped pattern, causing them to fail spectacularly across diverse tasks like classification, captioning, and question answering.
End-to-end recognition of complex chemical structures from documents is now possible, thanks to a new model and dataset that leapfrog existing methods.
Fuzzy logic bridges the gap between LLM reasoning and low-level artifact detection, creating a surprisingly effective AI-generated image detector.
Achieve robust maritime scene segmentation by jointly learning restoration, fusion, and segmentation, outperforming prior methods in complex, degraded conditions.
Achieve significantly better structure preservation in text-guided image editing by injecting structure-related features into visual autoregressive models, guided by reinforcement learning.
Forget disjointed workflows: AutoCut's unified token space for video, audio, and text slashes ad production costs while boosting consistency.
Finally, a way to measure how efficiently a sketch conveys meaning, moving beyond simple recognition accuracy.
Finally, you can precisely control specific objects in long, consistent driving videos, even those pesky long-tail objects.
Unlock CLIP's black box: EZPC reveals the "why" behind zero-shot image recognition by grounding predictions in human-understandable concepts, without sacrificing accuracy.
ColorFLUX achieves superior old photo colorization by cleverly disentangling structure and color, outperforming even closed-source commercial models.
Bypass the need for predicate annotations in 3D scene graph pretraining with a novel topological layout learning approach that enforces predicate relation learning.
Forget clunky 3D modeling software – ObjectMorpher lets you intuitively reshape objects in images with simple drags, yielding photorealistic results that stomp both 2D and previous 3D-aware methods.
Training remote sensing image-text retrieval models on real-world noisy data can be significantly improved by a self-paced learning strategy that mimics human cognitive learning patterns.