Search papers, labs, and topics across Lattice.
100 papers published across 9 labs.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.
Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.
Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.
Unlock the secrets hidden in your lab's backed-up microscopy data: style transfer networks can now "re-imagine" images as if they were captured with different instrument settings.
Frozen vision-language models can dramatically improve abnormality grounding in rare disease imaging by iteratively refining decisions through optimized instructions and visual perturbations.
Decomposing robotic manipulation into coarse and fine-grained actions isn't just conceptually cleaner—it actually unlocks a sweet spot where learning difficulty is balanced, boosting performance.
Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.
Training on semantically equivalent chart renderings in Python, R, and LaTeX unlocks surprisingly effective multi-lingual chart-to-code generation from a single model.
Achieve SOTA zero-shot segmentation by simply fusing two CLIP branches, one focusing on local token reliability and the other on structural priors, all without training.
Agentic AI struggles with Earth Observation because reprojection, resampling, and other geospatial operations silently corrupt data, demanding a new agent design paradigm.
Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.
Text-guided 3D medical image segmentation just got a whole lot more practical: ESICA achieves state-of-the-art accuracy with a "Lite" variant that slashes parameter count without sacrificing performance.
Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.
Text-to-video models can now learn geometrically consistent world dynamics via reinforcement learning, without expensive architectural changes.
Test-time adaptation of vision-language models can actually *hurt* performance when modalities shift asymmetrically; MG-MTTA fixes this by explicitly modeling modality reliability.
Turns out, your image-generating diffusion model already knows how to segment anything you ask it to.
Robots can now understand human intentions with near-human accuracy thanks to a new video-language model that reasons about goals like a human.
Robots can now leverage human intuition for manipulation tasks, learning from a massive video dataset to improve motion plausibility and robustness, even when conditions change.
Network jitter in cloud-based robot control can be overcome by converting temporal lag into spatial pose offsets, restoring the VLA's original geometric intent without fine-tuning.
Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.
Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.
Forget end-to-end fine-tuning: $M^2$-VLA unlocks the power of generalized VLMs for robotic manipulation by intelligently mixing layers and incorporating meta-skills.
Self-supervised vision models that ace linear probing can still flop at semantic image retrieval because of skewed latent space geometry that breaks approximate nearest neighbor search.
Quantum kernels unlock signal in medical image embeddings where classical methods fail, suggesting a new path for extracting value from medical foundation models.
Semantic grounding, not token probability, is the key to better multimodal RAG.
Forget slow, multi-step action generation: CF-VLA's coarse-to-fine approach slashes latency by 75% while boosting real-robot success rates to a new high of 83%.
Species identification and discovery, traditionally treated as separate problems, can be unified into a single framework that leverages retrieval-augmented reasoning for improved accuracy and interpretability.
CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.
Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.
A new large-scale dataset of human-annotated video crops enables training models that adapt videos to different aspect ratios while preserving visual quality and meaning.
You don't need billions of parameters to accurately ground GUI elements: GoClick, a 230M parameter model, matches the performance of much larger models, opening the door for on-device GUI agents.
VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.
Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.
Existing GUI agents can parrot actions, but AutoGUI-v2 reveals they still lack a deep understanding of GUI functionality and struggle to predict the outcomes of even simple interactions.
Achieve surgical 3D edits without training: Prox-E lets you reshape objects with language by manipulating a compact set of geometric primitives.
Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.
LLM agents struggle to maintain performance in multi-day collaborative tasks, dropping significantly after just one environmental update, revealing a critical gap in adaptation to evolving real-world conditions.
VLA models introduce a fundamentally new risk landscape compared to LLMs or robotics alone, demanding a unified safety perspective that considers irreversible physical consequences and multimodal attack surfaces.
Unlock the secrets of the deep: OceanPile, a massive, meticulously curated multimodal dataset, finally brings the power of foundation models to the vast and underexplored ocean.
Finding similar analog circuits across netlists, schematics, and descriptions just got way easier: a new model achieves 75% recall, unlocking better circuit design automation.
VLM evaluators, despite their growing use, can miss over 50% of targeted errors in generated images and text, especially when those errors involve fine-grained details or spatial relationships.
Transforming human motion into structured language allows LLMs to achieve unprecedented accuracy in motion understanding without the constraints of traditional encoding methods.
Stop guessing which interactive video model is best: WorldMark offers the first apples-to-apples comparison across leading models on identical scenes and trajectories.
Training a single model across text, images, video, 3D geometry, and hidden representations unlocks "Context Unrolling," where the model reasons across modalities to improve reasoning fidelity.
LVLMs are often tripped up not by faulty vision, but by over-trusting the textual prompt, leading to surprisingly easy-to-fix hallucinations.
Ramen achieves robust test-time adaptation of VLMs in mixed-domain scenarios by selecting the right samples to adapt to, sidestepping the common pitfall of performance degradation when faced with diverse and inconsistent test data.
Stimuli that vision models agree on most strongly drive alignment with language models, doubling cross-modal convergence.
LLMs struggle to answer human-generated questions about multi-chart images, highlighting a critical gap in their ability to reason about real-world data visualizations.
Learnable critics that evaluate the model's own GUI grounding proposals, rather than relying on static geometric heuristics, unlock substantial gains in accuracy.
Ignoring why clinical data is missing can lead to suboptimal treatment policies; this work shows how explicitly modeling informative missingness in multimodal time series data significantly improves both offline treatment policy learning and outcome prediction.
Even GPT-5 only achieves 63% accuracy on time series anomaly questions from real software incidents, but a model-expert combination reaches 87%, highlighting the potential for hybrid intelligence in incident response.
LLMs can extract events more effectively when combined with graph-based document representations that overcome their "lost-in-the-middle" limitations.
Forget rigid workflows: HiCrew's planning layer dynamically orchestrates agents for video understanding, adapting roles and execution paths to the nuances of each question.
LLM-driven visual agents form complex communication structures, but stubbornly resist stylistic convergence, revealing a fundamental tension between social expression and individual identity.
Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.
VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.
MLLMs struggle to "read" missing text directly from visual context, even when they possess the necessary visual grounding and layout understanding.
SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.
Pinpointing exactly *when* misinformation occurs in videos is now possible, thanks to two new datasets and a strong baseline for misinformation span detection.
Imagine reconstructing detailed human motion and scene layouts using just your smartwatch and earbuds – no cameras needed.
VLMs can reliably reveal population-level trends in climate change discourse on social media, even when per-image accuracy is only moderate.
MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.
Current video Q&A benchmarks can be fooled by textual regularities, failing to actually ground reasoning in the video's physical reality.
Multi-modification image retrieval is now possible: TEMA handles complex, real-world instructions that go beyond simple changes, outperforming existing methods on new datasets M-FashionIQ and M-CIRR.
Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.
H&E slides can now predict spatial gene expression with significantly improved accuracy and robustness, even when faced with unseen slide variations, thanks to a novel post-hoc calibration technique.
Forget optimal transport – MMD with Neural Tangent Kernels offers a faster, easier-to-optimize path to unsupervised video action segmentation with competitive accuracy.
Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.
Spatial reasoning gets a boost: a new framework dynamically orchestrates vision-language agents at test time, outperforming fixed-pipeline approaches by adapting to the reliability of different spatial cues.
Ditch the cache: Prototype-Based Test-Time Adaptation (PTA) boosts vision-language model accuracy by nearly 4% while *doubling* inference speed compared to existing cache-based methods.
By adversarially removing camera-specific fingerprints, FryNet forces models to learn genuine chemical representations from thermal images, enabling robust and generalizable frying oil oxidation assessment.
Achieve more precise facial attribute editing by decoupling attribute manipulation from image synthesis, sidestepping the optimization challenges of directly combining GANs and diffusion models.
Forget boring ads: this new method uses creative knowledge to generate videos that actually match product features and move realistically.
Unsupervised video-based person re-identification is now possible without hard pseudo-label assignments, thanks to a hierarchical temporal prototyping approach that significantly outperforms existing methods.
LMMs can gain surprising robustness and visual understanding by learning to denoise corrupted visual tokens, even without extra inference overhead.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
Achieve state-of-the-art facial attribute editing and style manipulation with a diffusion model by ditching semantic directions for style codes and a clever forward-backward consistency training strategy that avoids paired images.
Forget brittle visual-history buffers: LoHo-Manip uses a VLM task manager with visual trace prompts to achieve robust long-horizon robotic manipulation through implicit closed-loop replanning.
Real-world robots can now navigate complex environments with human-level instructions, thanks to a new system that combines efficient perception with high-level reasoning, all while running in real-time on limited hardware.
Current VLA benchmarks may be overstating real-world readiness, as models succeeding by standard metrics often exhibit unsafe behaviors and poor robustness.
Fine-tuning VLMs with action-aligned language supervision and terrain-aware preference optimization unlocks more robust off-road autonomous driving, outperforming prior approaches on key traversability metrics.
Explicitly constraining action generation with predicted spatial "corridors" boosts VLA model performance by up to 12.4% on challenging robotic manipulation tasks.
Current multimodal LLMs still struggle to integrate information and reason critically when assessed on real scientific papers, despite progress on isolated tasks.
Current technostress research overlooks neurodiversity, but this multimodal design could reveal hidden vulnerabilities and inform more inclusive digital work environments.
By spectrally decoupling robot control into intent and dynamics, ResVLA offers a more efficient and robust approach to generative VLA policies.
Early fusion UMR models lean too heavily on text, while late fusion struggles to relate semantically similar content – MiMIC offers a fix.
Synthetic data can significantly boost controllable human video generation, but only if you carefully select which synthetic samples to use.
Time is a learnable visual concept: models can now reason about and manipulate the flow of time in videos, opening doors to temporally controllable video generation and temporal forensics.
Reshooting video from arbitrary viewpoints just got a whole lot better thanks to a 4D point cloud representation that maintains temporal consistency and precise camera control.
Image editing models can learn to solve visual planning puzzles with finetuning, but still lag far behind humans in zero-shot efficiency, revealing a key gap in neural visual reasoning.
Ditch the fixed trade-offs: ParetoSlider lets you smoothly navigate competing generative goals in diffusion models at inference time, without retraining.
Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.
Vision-based tactile signals in the VTOUCH dataset significantly enhance bimanual manipulation capabilities, paving the way for more effective robotic interactions.
Ditch sparse contact cues: LEXIS-Flow uses a learned manifold of interaction signatures to capture dense, continuous proximity between humans and objects, leading to more realistic 3D HOI reconstructions.
Open-source MLLMs can now achieve state-of-the-art accuracy on complex tabular reasoning tasks, even outperforming models 18x their size, by explicitly penalizing visual hallucinations and shortcut guessing through process-supervised RL.
Current MLLMs fail to detect covert advertisements, revealing a critical gap in social media moderation that could mislead consumers and pose ethical risks.