Search papers, labs, and topics across Lattice.
54 papers published across 6 labs.
Cut inference verification costs by 1000x with a sampling-based cryptographic approach that catches adversarial attacks on Llama-2-7B in milliseconds.
Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.
TVLA misses subtle side-channel leakage in neural networks, but a new statistical test closes the gap.
Diffusion language models can achieve up to 26x inference speedups with almost no accuracy loss, thanks to a clever entropy-based KV caching strategy that avoids costly full forward passes.
LLMs can maintain generation quality in long-context scenarios while using significantly less context, simply by adaptively allocating context based on uncertainty.
Cut inference verification costs by 1000x with a sampling-based cryptographic approach that catches adversarial attacks on Llama-2-7B in milliseconds.
Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.
TVLA misses subtle side-channel leakage in neural networks, but a new statistical test closes the gap.
Diffusion language models can achieve up to 26x inference speedups with almost no accuracy loss, thanks to a clever entropy-based KV caching strategy that avoids costly full forward passes.
LLMs can maintain generation quality in long-context scenarios while using significantly less context, simply by adaptively allocating context based on uncertainty.
Securing legacy industrial protocols with modern encryption like ChaCha20-Poly1305 is far more practical than previously thought, adding single-digit percentage overhead to latency-sensitive applications.
Accurately simulate LLM inference power consumption at scale – from individual GPUs to entire datacenters – with a framework that learns from real-world traces and generalizes to unseen configurations.
Forget massive SRAMs: this work shows that clever data streaming and compute/transfer overlap can yield 22x speedups for transformer inference, even with standard PCIe interconnects.
Get continuous level-of-detail rendering in 3D Gaussian Splatting without sacrificing top-end quality – no architectural changes needed.
LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.
Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.
Forget fixed decoding strategies – RL can learn a lightweight policy to adapt LLM sampling *at test time*, boosting summarization quality by up to 88% without retraining the LLM.
Confidential databases can be 78x faster by ditching crypto in the query path.
Unlocking new high-probability differentials in SIMON32 cracks open avenues for more efficient cryptanalysis, pushing past current state-of-the-art round limits.
Text-to-image synthesis just got almost 4x faster without sacrificing image quality, thanks to a clever twist on Speculative Jacobi Decoding that keeps the generation process moving even when initial drafts are rejected.
Compact ViTs can now rival or surpass CNN-based architectures like YOLO for edge-based object detection, instance segmentation, and pose estimation, thanks to task-specialized distillation.
Ditch the training: SVOO achieves up to 1.93x speedup in video generation with sparse attention by exploiting the intrinsic, layer-specific sparsity patterns of attention without any fine-tuning.
Achieve nearly 3x faster LLM inference by intelligently splitting the workload between edge devices and the cloud, without any training.
Edge devices can now run MoEs in real-time thanks to a dynamic quantization scheme that prioritizes important experts and critical layers.
Discrete diffusion models can now generate more diverse text without sacrificing quality, thanks to a new decoding method that explicitly optimizes for diversity during beam search.
Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.
Flow-based VLAs can react to environmental changes ten times faster by adaptively prioritizing near-term actions during sampling, unlocking unprecedented real-time responsiveness.
Training speculative decoding models just got an order of magnitude faster, unlocking real-world deployment with a new open-source framework and a suite of production-ready draft models.
LLM endpoints can appear "healthy" according to traditional metrics while undergoing subtle behavioral shifts detectable by monitoring output distributions, highlighting a critical gap in current reliability practices.
LLM watermarks can now survive fine-tuning, quantization, and distillation thanks to a new method that embeds them in a stable functional subspace.
Dramatically speed up histopathology super-resolution by adaptively routing image tiles through a flow-matching network, achieving near-lossless quality at a fraction of the compute.
Video diffusion models can be aggressively quantized down to 6-bit precision with minimal quality loss by dynamically adapting the bit-width of each layer based on its temporal stability.
Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.
Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.
Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.
LLMs can predict multiple tokens in parallel without any training, simply by cleverly probing their embedding space with dynamically generated mask tokens.
Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.
Forget scaling laws: dropout robustness in transformers is a lottery, with smaller models sometimes showing perfect stability while larger models crumble under stochastic inference.
Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.
Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.
Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.
Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.
Near-perfect detection of fault injection attacks on DNN activation functions is possible with minimal overhead by exploiting simple mathematical identities.
Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.
KANs get a 50x BitOps reduction without accuracy loss by quantizing their B-splines down to 2-3 bits and using lookup tables.
Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.
LLMs can be drastically compressed without retraining because the relative ordering of weights matters far more than their exact values, opening the door to efficient, training-free compression techniques.
Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.
Ditch the separate anomaly detection model: your existing ML model already holds the keys to faster, better anomaly detection.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.
Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.
Achieve 100x radar data compression with only a 1% performance drop by adaptively pruning DCT coefficients based on detection confidence gradients.
LLMs can maintain performance while skipping global attention for 80% of tokens, slashing compute costs and memory footprint in long-context scenarios.
Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.
Instance-specific timestep schedules can significantly boost diffusion model performance, challenging the reliance on global discretization strategies.
LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.
Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.
Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.