Search papers, labs, and topics across Lattice.

Research division of NVIDIA focusing on GPU-accelerated AI, computer graphics, robotics, and autonomous systems.
86
267
3
Claims of quantum advantage in electronic structure calculations must now contend with DMRG benchmarks achieving CAS(89,102) on Fe$_5$S$_{12}$H$_4^{5-}$, pushing the boundaries of classical computation.
Achieve 49% and 19% better Chamfer distance than state-of-the-art dynamic surface reconstruction methods on Hi4D and CMU Panoptic datasets, respectively, by enforcing temporal consistency in Gaussian Splatting.
A 30B MoE model can now achieve Gold Medal-level performance in IMO, IOI, and ICPC, rivaling frontier models with 20x more parameters.
Humanoid robots can now traverse complex terrains with human-like gaits, thanks to a surprisingly simple and efficient framework that eschews adversarial training.
World Action Models can ditch the slow, iterative "imagine-then-execute" loop at test time without sacrificing performance, achieving a 4x speedup.
A hybrid cuVSLAM-based visual SLAM system achieves superior mapping accuracy in real-world logistics environments, outperforming other VO/VSLAM approaches.
Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.
Forget expensive real-world data collection: a massive, diverse synthetic dataset enables surprisingly effective zero-shot transfer for robotic manipulation.
Stop wrestling with incompatible human body models: SOMA lets you mix and match SMPL, SMPL-X, and more, unlocking the power of diverse datasets in a single, differentiable pipeline.
Current image generation unlearning methods are surprisingly brittle: adversarial image prompts, optimized with attention-guided masking, can effectively resurrect supposedly "forgotten" concepts.
Now you can predict the structure of biomolecular assemblies exceeding 30,000 residues, thanks to a new context parallelism framework that shatters previous memory constraints.
Humanoid robots can now handle heavy, unknown payloads in the real world thanks to a system that identifies mass distribution via differentiable simulation.
Rivaling English's GigaSpeech in scale, TAGARELA unlocks the potential for state-of-the-art Portuguese speech models with its nearly 9,000 hours of podcast audio.
Domain skew in federated learning can be tamed by decoupling and calibrating domain-specific features, leading to more consistent and generalizable global models.
Tactile sensing can be efficiently injected into vision-language-action models via feature-wise linear modulation, boosting robot manipulation performance without the computational overhead of large-scale pretraining.
Current multimodal models are surprisingly bad at understanding long, complex videos, struggling to integrate audio, visual, and text cues even for basic reasoning tasks.
LLMs can now navigate massive toolsets with a "Try-Check-Retry" loop, boosting tool-calling accuracy by up to 25% and letting smaller models punch above their weight.
Forget slow, model-dependent curation: FAKTUAL offers a fast, model-free way to boost robot imitation learning by directly maximizing the entropy of demonstration datasets.
Achieve real-time, synchronized audio-visual generation at 25 FPS by distilling a bidirectional diffusion model into a fast, autoregressive architecture, overcoming training instability with novel alignment and token handling techniques.
Studio-quality speech enhancement without hallucination is now possible, thanks to a clever combination of dry-target finetuning and flow-matching.
Retrieval augmentation lets head avatars handle novel expressions better by mixing in similar expressions from a large unlabeled dataset during training, boosting generalization without extra labels or architecture changes.
Achieve a 40% jump in success rates on real-world contact-rich manipulation by intelligently scheduling force feedback into visual-motor policies.
Forget predefined areas of interest: this multi-agent exploration framework uses Gaussian belief mapping to adaptively balance scientific discovery and safety in hazardous off-world environments.
Human-robot teams can slash interaction costs by 50% and task times by 25% when robots actively resolve uncertainty about tasks and infer human intent using LLMs and spatial reasoning.
LLMs can orchestrate human input to UAVs, dramatically improving mission success rates while minimizing human interaction.
Achieve significantly sharper and more detailed 4D vascular reconstructions from sparse DSA data by injecting super-resolution priors into Gaussian Splatting.
Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.
By combining video generation and vision-language models, EmboAlign achieves a 43% boost in real-world robot manipulation success without any task-specific training.
Forget slow, end-to-end models: building real-time voice agents hinges on a cascaded streaming pipeline, as demonstrated by a new tutorial achieving sub-second latency.
Training generalist robots just got a whole lot easier: RoboCasa365 offers a massive, diverse, and reproducible benchmark for household mobile manipulation.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Forget text prompts: vector prompt interfaces are the key to unlocking scalable and stable LLM customization.
Injecting curvature information into MLIP training via Hessian-vector products achieves the accuracy of full-Hessian training with >24x speedups, opening the door to more efficient and accurate potential energy surface learning.
Forget everything you thought you knew about continual learning: pretrained Vision-Language-Action models can learn new robotic skills without catastrophic forgetting, even with minimal replay.
Achieve state-of-the-art results in high-resolution video geometry estimation by disentangling global coherence and fine detail using a dual-stream transformer architecture.
Autonomous exploration by an LLM agent dramatically outperforms both rigid retrieval workflows and supervised fine-tuning for temporal knowledge graph question answering, achieving state-of-the-art results in a zero-shot setting.
Learning robotic reward functions from a million trajectories reveals that comparing entire trajectories, not just individual frames, unlocks better generalization and learning from suboptimal data.
By explicitly guiding attention with predicted action sequences, AGA overcomes the limitations of standard dot-product attention in video action anticipation, leading to better generalization and interpretability.
Multimodal models often exhibit lower confidence than their unimodal counterparts when they're about to fail, and this work leverages that insight to build a better failure detector.
Representing tensor layouts with a hierarchical algebra unlocks powerful compile-time reasoning and simplifies the expression of tiling/partitioning patterns for specialized hardware.
Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.
Achieve up to 28% better success rates in whole-body mobile manipulation by decoupling base and arm control while intelligently allocating perceptual attention.
Ditch quadratic scaling in 3D reconstruction: VGG-T$^3$ achieves linear scaling and a 11.6x speed-up by distilling scene geometry into a fixed-size MLP.
By explicitly disentangling shared and view-specific features across multi-view fundus images, MVGFDR achieves superior diabetic retinopathy grading compared to methods that directly fuse visual features.
LLMs can now predict client-perceived therapeutic alliance with significantly higher accuracy and provide interpretable rationales, bridging the gap between counselor evaluations and client experiences.
Test-time training with KV binding isn't memorization, it's secretly a learned linear attention mechanism, unlocking architectural simplifications and parallelization.
Uniform-state diffusion models can now achieve state-of-the-art generative performance in language modeling, thanks to a new Predictor-Corrector sampler that breaks the quality plateau of ancestral sampling.
Forget hand-crafted datasets: a new synthetic data pipeline lets smaller LLMs beat giants at terminal tasks.
ShallowConvNet emerges as a surprisingly effective architecture for decoding user intent from EEG signals in real-world robotic control, outperforming more complex models like Transformers.
Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.
Time series generation can be dramatically improved by explicitly conditioning on semantic understanding, as demonstrated by a novel vision-centric framework.
Forget synthetic data—scaling up human egocentric video by 20x unlocks surprisingly effective dexterous robot manipulation, even transferring to robots with different hand configurations.
Unlock the potential of Kolmogorov-Arnold Networks with WS-KAN, a weight-space architecture that understands their hidden symmetries and predicts their performance far better than generic methods.
Forget painstakingly engineering robot behaviors: DreamZero learns directly from video of other robots or even humans, adapting to new tasks and bodies with just minutes of data.
GLM-5 doesn't just code; it engineers, showcasing unprecedented capability in tackling end-to-end software engineering challenges.
Forget complex architectures: RaCo achieves SOTA keypoint matching and repeatability by cleverly combining ranking and covariance estimation in a lightweight network, trained without covisible image pairs.
Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.
Forget robotics pre-training: ActionCodec, a new action tokenizer designed with information-theoretic principles, achieves state-of-the-art VLA performance on LIBERO.
Uniform-state diffusion models, often overlooked in favor of masked diffusion, surprisingly outperform autoregressive and masked diffusion models on GSM8K when scaled to 1.7B parameters, despite worse perplexity.
Achieve state-of-the-art depth completion by adapting 3D foundation models at test time with minimal parameter updates, outperforming task-specific encoders that often overfit.
Pathology image analysis just got a whole lot greener: LitePath slashes computational costs by 400x while matching the accuracy of state-of-the-art models, making AI-powered diagnostics accessible on low-power edge devices.
Key contribution not extracted.
Forget static datasets – RL-based co-training unlocks +20% real-world VLA performance by interactively leveraging simulation while preserving real-world capabilities.
Smaller reasoning models can achieve both higher accuracy and shorter reasoning chains by adaptively penalizing unnecessary reflections and coordinating length penalties with problem complexity.
Training a robot foundation model on 30,000 hours of heterogeneous embodied data lets it outperform prior methods by up to 48% on complex manipulation tasks and even benefit from low-quality data.
LLMs can now understand your spoken questions about complex traffic scenarios and reason about dynamic maps, opening up more intuitive human-machine interaction for autonomous driving.
You can slash LLM inference costs without sacrificing quality by strategically pruning experts, quantizing, and swapping full attention for windowed attention, as demonstrated on gpt-oss-120B.
Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.
Forget tedious manual segmentation: ArtisanGS lets you lasso objects in 3D Gaussian Splats with AI-powered 2D selections that propagate into 3D, giving you unprecedented control over editing.
Structural, numerical, and algebraic redundancy across pruning, quantization, and low-rank decomposition techniques are analyzed, enabling a criticality-aware compression framework that achieves near-lossless compression to 10% of the original size.
MLLMs can be made significantly safer in multi-turn dialogues with a new framework that combines cold-start refusal and turn-aware policy optimization, achieving a 10% drop in attack success rate.
Synthesizing realistic radar data from camera images is now possible, bridging the gap between visual and radar perception for autonomous driving.
LLM safety guardrails are far less robust than benchmarks suggest, with accuracy dropping by as much as 57% on novel adversarial attacks, and some even generating harmful content in a "helpful mode" jailbreak.
Current NVS evaluation metrics are misleading, so this paper introduces a task-aware framework using Zero123 features that actually aligns with human perception of quality and faithfulness.
Forget closed-source embedding models: llama-embed-nemotron-8b just topped the MMTEB leaderboard with fully open weights and a data recipe you can actually reproduce.
Forget synthetic data that looks like it came from a PS2 game: NVIDIA's new Cosmos-Predict2.5 generates high-fidelity videos for training embodied AI, opening the door to more realistic and reliable simulations.
This work integrates small-molecule high-throughput screening with a deep-learning-based virtual screening approach to uncover new antibacterial compounds, illustrating a 90-fold improved hit rate over the high-throughput screening experiment used for training.
A multi-agent system that decomposes code generation into planning, coding, and debugging achieves near-perfect pass@1 scores on HumanEval, suggesting a promising path toward reliable automated programming.
Synthetic MRI data, generated by a segmentation-conditioned diffusion model, can measurably improve the performance of 3D U-Nets for hepatic segmentation.
Imagine training robots to manipulate objects in the real world, but entirely within a high-fidelity, diffusion-based dream.
Open-source biomolecular modeling just got a boost: RF3 closes the gap with AlphaFold3 in structure prediction, thanks to the new AtomWorks data framework.
Robot foundation models can achieve state-of-the-art performance by explicitly reasoning about spatial plans as editable trajectory traces, rather than directly mapping perception to control.
Multimodal ophthalmic AI is poised for a leap, but current models still struggle with data variability, limited annotations, and generalization across diverse patient populations.
LLMs are surprisingly bad at keeping up with how people's minds change over time, lagging humans by 45% on a new benchmark designed to test this crucial social skill.
A 3B parameter model, Audio Flamingo 2, now rivals larger proprietary models in audio understanding and reasoning, even handling audio segments up to 5 minutes long.
Ditch the clunky pipelines: SongGen generates complete songs from text in a single pass, offering unprecedented control over musical elements and voice cloning.