Search papers, labs, and topics across Lattice.
21 papers from NVIDIA Research on Architecture Design (Transformers, SSMs, MoE)
Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.
MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.
Training trillion-parameter Mixture-of-Experts models just got a whole lot faster: Megatron Core now achieves >1 PFLOP/GPU on NVIDIA's latest hardware.
Text-to-video generation gets a 1.58x speed boost with CalibAtt, a training-free method that exploits consistent sparsity patterns in attention layers.
Forget scaling compute – the future of AI hinges on a 1000x leap in energy efficiency via tight AI+Hardware co-design over the next decade.
Forget text prompts: vector prompt interfaces are the key to unlocking scalable and stable LLM customization.
Achieve state-of-the-art results in high-resolution video geometry estimation by disentangling global coherence and fine detail using a dual-stream transformer architecture.
Representing tensor layouts with a hierarchical algebra unlocks powerful compile-time reasoning and simplifies the expression of tiling/partitioning patterns for specialized hardware.
By explicitly guiding attention with predicted action sequences, AGA overcomes the limitations of standard dot-product attention in video action anticipation, leading to better generalization and interpretability.
Ditch quadratic scaling in 3D reconstruction: VGG-T$^3$ achieves linear scaling and a 11.6x speed-up by distilling scene geometry into a fixed-size MLP.
By pausing to "think" with latent diffusion, STAR-LDM achieves superior language understanding, narrative coherence, and controllable generation compared to standard autoregressive models of similar size.
Test-time training with KV binding isn't memorization, it's secretly a learned linear attention mechanism, unlocking architectural simplifications and parallelization.
Time series generation can be dramatically improved by explicitly conditioning on semantic understanding, as demonstrated by a novel vision-centric framework.
Unlock the potential of Kolmogorov-Arnold Networks with WS-KAN, a weight-space architecture that understands their hidden symmetries and predicts their performance far better than generic methods.
Forget monolithic LoRAs: LoRWeB dynamically mixes a basis set of LoRAs to unlock SOTA generalization in visual analogy tasks.
Achieve state-of-the-art depth completion by adapting 3D foundation models at test time with minimal parameter updates, outperforming task-specific encoders that often overfit.
Diffusion models can now solve notoriously ill-posed inverse problems in carbon capture and storage, outperforming standard methods by an order of magnitude and even rivaling asymptotically exact methods like Rejection Sampling, but with better physical realism.
Forget fixed masking ratios: this new self-supervised learning approach for time-series data dynamically adjusts noise levels to extract richer, more versatile representations.
You can slash LLM inference costs without sacrificing quality by strategically pruning experts, quantizing, and swapping full attention for windowed attention, as demonstrated on gpt-oss-120B.
By decoupling MLLM instruction tuning from DiT alignment, DuoGen achieves state-of-the-art interleaved multimodal generation without costly unimodal pretraining.
Ditch the clunky pipelines: SongGen generates complete songs from text in a single pass, offering unprecedented control over musical elements and voice cloning.