Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.
Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.
Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.
Ditch the complex multimodal pre-training pipelines: GenLIP proves a simple language modeling objective can effectively align vision encoders with LLMs, achieving strong performance with less data.
LVLMs can maintain sharper visual focus during long-form generation by adding a lightweight, learnable memory module that bypasses attention dilution.
LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.
Architectural diversity offers surprisingly little defense against adversarial attacks on VLMs for autonomous driving, with physical patches transferring effectively across different models.
Current multimodal LLMs struggle to understand scientific spectra, but a new benchmark and data processing technique could change that.
Enterprise AI doesn't have to be a latency nightmare: this pattern language offers a blueprint for integrating VLAs with deterministic control loops.
Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.
Ignoring language-specific structure in scene-text captioning is a recipe for disaster in tonal languages like Vietnamese, but a new graph framework leveraging phonological attention can help.
Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.
A 48-camera system finally unlocks real-time, room-scale multi-human, multi-robot interaction research in realistic home environments.
By unifying specialized detectors with MLLMs in an agentic framework, Echo-{\alpha} achieves state-of-the-art ultrasound interpretation, suggesting a path to more accurate, interpretable, and transferable medical AI.
Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.
Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.
By explicitly aligning image features with the hierarchical structure of radiology reports, RIHA generates more clinically accurate and coherent reports than models that treat reports as flat sequences.
Forget task-specific architectures: Uni-HOI uses a unified framework with LLMs to jointly model text, human motion, and object motion, enabling strong performance across diverse HOI tasks.
EdgeFM delivers production-grade VLM/LLM inference performance on edge devices, outperforming vendor-specific toolchains by up to 49% while remaining open-source and cross-platform.
Achieve high-fidelity 3D rendering from sparse, unconstrained real-world images by intelligently synthesizing novel views with diffusion models and Gaussian replication.
Ditching PCA for spectral reduction can yield state-of-the-art performance in multisource remote sensing image classification while slashing computational costs.
Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.
Unlock a baby's-eye view: Reconstruct and replay infant movements on robots to simulate their sensory experiences, offering unprecedented insights into early development.
Integrating visual cues into a long-context ASR system slashes word error rate by 16% in multi-talker conversational recordings, proving the power of AV fusion.
Stop drowning your MLLMs in irrelevant document noise: FES-RAG shows that carefully selecting multimodal fragments as evidence boosts performance by up to 27% while shrinking context length.
Teaching VLMs to "look back" and "look ahead" with lightweight spatial reasoning tasks unlocks surprisingly strong navigation performance.
Simple frequency masking and gated injection can dramatically improve the generalization of AI-generated image detectors, even against unseen generative models.
Ditch the costly sampling: Noise2Map turns diffusion models into fast, end-to-end semantic segmentation and change detection machines by directly predicting maps from noise.
VLMs can get a boost in long-tail performance and train more efficiently by dynamically upsampling underrepresented data clusters each epoch.
Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.
Expert-level video aesthetics can be captured and improved using a hierarchical rubric and reward models trained with a progressive learning scheme.
Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.
Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.
Diffusion models struggle with multi-object generation not because of imbalanced concept representation, but primarily due to scene complexity and a surprising difficulty in counting, especially when training data is limited.
Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
By jointly embedding spatial biology, histology, and clinical data, Haiku lets you ask "what if" questions about disease progression, revealing molecular shifts linked to clinical outcomes.
Real-time robot control just got a 50x speed boost thanks to MotuBrain's efficient world action model.
Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.
Imagine a Pokemon TCG where every card is uniquely yours, dynamically generated by AI to reflect your playstyle and preferences.
VLMs playing the Prisoner's Dilemma can be manipulated into selfish behavior simply by showing them images of aggression or reward matrices with specific color schemes.
By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.
MLLMs can ace circuit-to-code generation by cheating with identifier semantics, even when the circuit diagram is blank.
Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.
Achieve state-of-the-art multimodal stance detection by having multiple AI agents debate each other, complete with retrieval-augmented context and self-reflection.
A generative model of human physiology not only beats existing clinical risk scores at predicting disease, but also accurately simulates the effects of clinical interventions, paving the way for personalized medicine.
Ditching text chunks for full document page images in medical RAG boosts QA accuracy by a full percentage point, proving that visual context matters.
A single, optimized text snippet can fool CLIP into thinking it's a good caption for almost any image, revealing a surprising vulnerability in cross-modal understanding.
A carefully crafted synthetic data pipeline and rubric-guided RL lets a 4B parameter model nearly match Gemini-3-Flash on wafer defect analysis, suggesting that data quality and targeted training can trump sheer model size.
Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.
Controllable 3D generation takes a leap forward with 3D-ReGen, a framework that leverages an initial 3D shape for tasks like enhancement and editing, outperforming existing methods.
Ditch the garment masks: a simple human mask is all you need to nail video virtual try-on in the wild.
Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.
Initializing prompts in flatter regions of the loss landscape dramatically improves calibration and performance in test-time prompt tuning for vision-language models.
By explicitly modeling relationships between multiple relevant video segments, ClipTBP significantly improves video moment retrieval, especially when queries are ambiguous.
LVLMs leak visual text style into semantic inference, meaning the font of a word can change the attributes a model associates with the concept it represents.
Flat 2D images can now be turned into voluminous 3D assets with state-of-the-art fidelity, thanks to a clever inflated-prior and latent-refinement pipeline.
Self-supervised learning from driving videos can beat fully supervised methods for camera pose estimation, even with orders of magnitude less labeled data.
Current MLLMs still struggle to connect the dots between images and text when they're interleaved, highlighting a critical gap in real-world multimodal understanding.
Autonomous driving gets a 30% performance boost in challenging scenarios by having VLAs critique and refine their own driving plans.
Fusing dermoscopic images, clinical photos, and patient metadata with adaptive weighting dramatically improves skin lesion classification, even in imbalanced, real-world clinical datasets.
Quadruped robots can now perform contact-rich manipulation with significantly improved dexterity by learning to "feel" their way through tasks.
Achieve faster VLM inference in bandwidth-constrained edge environments by adaptively compressing visual data, outperforming full-edge and full-cloud solutions without sacrificing semantic accuracy.
Snapchat's new trend detection system proves that LLMs can successfully consolidate multimodal signals at scale to surface emerging topics from short-form video, boosting content freshness and user engagement.
Skewed item distributions in recommendation systems can be tamed with a learnable non-uniform quantization, leading to better codebook utilization and more accurate generative recommendations.
Forget static graphs: TimeMM dynamically reweights user-item interactions based on recency and modality, adapting to evolving user preferences in multimodal recommendations.
Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.
Texture, not color, is the secret sauce behind fashion house identity, revealed by probing a multimodal CNN trained on decades of Vogue runway images.
Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.
VLN agents can navigate more accurately in zero-shot settings by "looking forward, now, and backward," mimicking human navigational strategies.
Nighttime UAVs can navigate using only thermal cameras and semantic maps, achieving meter-level accuracy without GPS.
Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.
Document AI pipelines don't work the way you think: quality bottlenecks aren't where you expect, and components don't cascade quality.
An AI agent autonomously discovered four new superconductors, shrinking the discovery timeline from years to GPU hours.
Despite recent advances, sign language translation models still struggle to leverage the full range of linguistic cues, especially non-manual signals like facial expressions.
LLM agents can now remember far more, far more accurately, by "seeing" their past experiences instead of just reading about them.
VideoLLMs leak training data: a novel black-box attack recovers membership with surprisingly high accuracy (AUC=0.68) by probing generation brittleness across temperatures.
Code stylometry, often overlooked, can significantly boost vulnerability detection, improving F1 scores by up to 48% on key benchmarks.
Time-series classification gets a visual upgrade: fusing raw data with intuitive charts like line, bar, and scatter plots can boost accuracy, especially on smaller datasets.
Predicting comment popularity is more than just content quality – stylistic resonance with the platform's user base is a key ingredient, and this benchmark helps you measure it.
RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.
Semantic SLAM can now understand free-form language queries and ground them in 3D space using only a monocular video feed, opening the door to robots that truly understand and interact with the world around them.
Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.
Human motion generation gets a dose of reality: IAM shows that explicitly modeling body morphology and identity leads to more realistic and consistent movements.
Skip the bulky bidirectional teacher: this new method trains a fast, causal audio-video generator directly, slashing sampling steps while maintaining top-tier quality.
Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.
VLMs can ace the ranking but bomb the scoring, revealing a critical flaw in how we evaluate multimodal systems.
Achieve 3x better coverage on out-of-distribution visual question answering by explicitly scoring the quality of visual evidence, even when using black-box models like Gemini-3-Pro.
Software vulnerability detection gets a serious upgrade: aligning code with developer comments boosts F1 scores by up to 27% compared to traditional code-only methods.
Decoupling retrieval and reranking with a discrete diffusion model leaps ahead of monolithic embedding scorers for multi-modal knowledge graph completion.
LVLMs hallucinate less when you intervene *before* they start generating, by cleaning up the initial Key-Value cache with modality-aware steering vectors.
Multimodal language models are fluent liars: they produce convincing procedural video captions that are often factually incomplete, with systematic omissions and role-level inconsistencies exposed by video-grounded verification.
Decoupling the "Thinker" from the "Editor" in image editing allows targeted optimization of reasoning, leading to performance competitive with strong proprietary models using a fixed generative model.
Current VLMs ace diagram question answering, but DRAGON reveals they often fake it, failing to ground their answers in the actual visual evidence.
Forget tedious manual workflows: LLMs can now autonomously generate editable, engine-native 3D cutscenes by intelligently orchestrating animation, cinematography, and sound design.
Diffusion models can now reason recursively over visual tokens, achieving state-of-the-art image generation performance by dynamically selecting specialized neural modules at each diffusion step.
Emotion recognition can be significantly improved by adapting to individual expressive traits, with ML-SAN outperforming static models in capturing nuanced emotional expressions.
Twitter strips C2PA provenance data from AI-generated images, making it impossible to cryptographically verify their origin on the platform.