Search papers, labs, and topics across Lattice.
100 papers published across 9 labs.
Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.
LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.
Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.
Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.
Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.
Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.
Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.
Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.
LLMs can predict multiple tokens in parallel without any training, simply by cleverly probing their embedding space with dynamically generated mask tokens.
Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.
Forget scaling laws: dropout robustness in transformers is a lottery, with smaller models sometimes showing perfect stability while larger models crumble under stochastic inference.
Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.
Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.
Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.
Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.
Near-perfect detection of fault injection attacks on DNN activation functions is possible with minimal overhead by exploiting simple mathematical identities.
Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.
KANs get a 50x BitOps reduction without accuracy loss by quantizing their B-splines down to 2-3 bits and using lookup tables.
Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.
LLMs can be drastically compressed without retraining because the relative ordering of weights matters far more than their exact values, opening the door to efficient, training-free compression techniques.
Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.
Ditch the separate anomaly detection model: your existing ML model already holds the keys to faster, better anomaly detection.
Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.
Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.
Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.
Achieve 100x radar data compression with only a 1% performance drop by adaptively pruning DCT coefficients based on detection confidence gradients.
LLMs can maintain performance while skipping global attention for 80% of tokens, slashing compute costs and memory footprint in long-context scenarios.
Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.
Instance-specific timestep schedules can significantly boost diffusion model performance, challenging the reliance on global discretization strategies.
LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.
Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.
Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.
Forget full finetuning: OPERA's dynamic pruning lets you adapt retrieval models to new domains with better ranking and recall, in half the time.
Biased compression, previously overlooked in distributed learning with gradient coding, can actually boost performance when combined with error feedback to mitigate straggler effects and reduce communication costs.
Achieve personalized generation with cloud-scale reasoning while preserving user privacy, thanks to a novel asymmetric collaboration framework that's also 2x faster.
Forget perplexity – ZipCal uses Zipf's law to curate calibration data for LLM compression, matching state-of-the-art performance at 240x the speed.
Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.
Shrinking a leading 3D hand mesh reconstruction model by 65% yields a 1.5x speedup with minimal accuracy loss, unlocking real-time performance on resource-constrained devices.
Resource consumption vulnerabilities in LLMs can lead to degraded service availability and economic sustainability, demanding a systematic understanding and mitigation approach.
LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.
MXFP4 quantization just got a whole lot better: BATQuant recovers up to 96.43% of full-precision performance in LLMs and MLLMs, even under aggressive W4A4KV16 settings, by preventing outlier propagation across quantization blocks.
Edge offloading with vAccSOL slashes robot-side power consumption by up to 80% and boosts vision pipeline frame rates by up to 24x, extending the operational lifespan of battery-powered robots.
Control video super-resolution with a few keyframes: SparkVSR lets you guide the process and fix artifacts, unlike black-box VSR models.
Object detectors can be made significantly more robust to domain shifts by distilling knowledge from a teacher network trained on clean data to a student trained on downscaled and corrupted versions of the same data.
Forget brute-force inversion: this study reveals a simple rule for choosing the fastest matrix update method in streaming outlier detection, slashing computation time.
Quantizing optimizer states in LLM pre-training introduces "staleness," but strategically timed resets can recover lost performance and reduce memory footprint.
Overcome the quadratic attention bottleneck in vision-language models with Parallel-ICL, a method that achieves comparable performance to full-context learning while drastically reducing inference time.
Frozen LLMs can learn to remember things across conversations, even with limited resources, by training adapters to read and write to a continuous latent space memory bank.
You can now run anomaly detection at 20 FPS with 94% AUROC on a Sony IMX500 sensor, thanks to an 8.7x parameter reduction in a new TinyGLASS architecture.
Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.
LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.
Shrinking LLM reasoning for mobile devices is now possible: LoRA adapters, RL-based budget forcing, and KV-cache tricks let Qwen2.5-7B reason efficiently on-device.
A simple orthogonal rotation of the activation space makes LLMs virtually immune to bit-flip attacks, even against targeted single-point faults.
Forget blindly pruning LLMs: this work shows you can use Sparse Autoencoders to identify and protect the most functionally important components during compression, leading to more robust models.
Blindly applying GPU optimizations to homomorphic encryption can leave nearly 2x performance on the table, as the best strategy hinges on CKKS parameters and GPU architecture.
Elastic-Sketch's performance hinges on stream characteristics and eviction thresholds, but this work cracks the code to near-optimal configuration by deriving closed-form expressions for its limiting behavior under stationary random streams.
Squeeze your LLM's KV cache by 82% without significant performance loss using VQKV's novel vector quantization approach.
Achieve diffusion-level perceptual quality in monocular depth estimation at 40x the speed, by replacing the slow initial diffusion steps with a fast ViT-based depth map and refining in a compact latent space.
SNNs can be pruned to extreme sparsity without sacrificing accuracy by explicitly controlling temporal distortion across layers and timesteps.
Binary neural networks can now be trained effectively in federated settings, offering a path to low-cost, privacy-preserving edge inference without sacrificing accuracy.
Achieve efficient and positionally consistent simultaneous machine translation with LLMs, regardless of the positional encoding method, using a surprisingly simple explicit position allocation strategy.
Inference time can reveal the GPU models behind black-box LLM APIs, offering a way to estimate their hidden energy costs.
TinyML for agriculture is trending towards localized inference on microcontrollers, but inconsistent resource reporting is slowing down real-world deployment.
Sparsity, often viewed as a means for efficiency, actually unlocks deeper, more effective LLMs by taming variance and boosting layer utilization.
Achieve real-time object detection on resource-constrained AR/VR devices by ditching compute-heavy operations for memory lookups inspired by human vision.
Event cameras get a major efficiency boost: EECVS achieves 2.7x higher throughput and superior generalization in downstream tasks by adaptively compressing event streams using tailored transforms.
Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.
Forget complex combinators: a simple multiplication trick can slash LLM latency by 92% and boost throughput by 21%, outperforming production schedulers.
Machine translation can now safeguard sensitive information during inference thanks to a new task, benchmark datasets, and metrics designed to protect named entities.
Mamba-3 delivers a 1.8 point accuracy boost over competing models in downstream language tasks, proving that SSM-inspired techniques can unlock substantial performance gains without sacrificing inference efficiency.
Squeeze 2x more speed from your conditional flow matching models by optimizing data-noise coupling across minibatches.
Achieve real-time full-body human mesh recovery from a single RGB stream with Fast SAM 3D Body, a 10x speedup over the original without sacrificing accuracy.
For spacecraft-bound neural networks, a new bit-serial matrix multiplication accelerator, bitSMM, delivers impressive GOPS/W on both FPGA and ASIC, promising efficient on-board inference.
Achieve near-ideal GPU sharing without kernel hacks: DetShare guarantees semantic and performance determinism through GPU coroutines and lightweight context migration.
xLSTM models can now effectively learn from large attention-based models, even outperforming their teachers on some tasks through a novel distillation and merging pipeline.
Textual pathways in LVLMs are more sensitive to pruning than visual pathways, implying that you can aggressively prune visual inputs without significantly impacting performance.
Cuckoo filters on GPUs can now achieve performance rivaling append-only Bloom filters, thanks to a novel lock-free architecture and memory access optimization strategy that closes the gap between static and dynamic approximate membership query structures.
Multi-agent LLM systems can slash synchronization costs by up to 95% by borrowing cache coherence strategies from chip design.
LLMs can run up to 35% faster on chiplet architectures thanks to a new lossless exponent compression technique that slashes inter-chiplet communication overhead.
Exact sampling in large-vocabulary decoding can be sped up by 19% simply by fusing it into the LM-head matmul, turning a bandwidth bottleneck into a lightweight epilogue.
TabKD achieves state-of-the-art data-free knowledge distillation for tabular data by generating synthetic data that maximizes interaction diversity, a critical factor previously overlooked.
Turns out, blindly widening the beam search in your LLM can actually *hurt* performance due to overestimation bias, and the optimal width depends critically on your scorer's signal-to-noise ratio.
Forget exotic attention mechanisms – MobileLLM-Flash achieves up to 1.8x faster LLM prefill on mobile CPUs by smartly pruning and adapting existing architectures for on-device use.
Get quantitative safety guarantees with adjustable confidence levels for compressed neural networks, even after aggressive quantization and pruning.
Achieve >19x compression on high-resolution drone imagery without sacrificing object detection performance by intelligently allocating bitrates with a PPO-trained agent guiding a conditional diffusion model.
By selectively attending to question-relevant information across video frames and memory, QViC-MF achieves state-of-the-art results in long-term video understanding, highlighting the importance of feedback-driven perception.
Forget painstakingly tuning RL in the real world - SimDist lets you pre-train a world model in simulation and then rapidly adapt it via supervised learning, slashing data requirements and boosting performance.
Squeezing federated learning through bandwidth-constrained networks? This routing and pruning method boosts accuracy by 12% while slashing latency by 28%.
LLMs can solve math problems more efficiently by "thinking" silently in their latent space, adaptively refining their reasoning process only as much as needed, and slashing token usage by over 90%.
SALT offers a surprisingly effective way to personalize and harden split computing models in closed environments, using a lightweight adapter that outperforms full fine-tuning while slashing training costs.
Forget training from scratch: PrototypeNAS finds deployable MCU-optimized DNNs in minutes using zero-shot proxies and smart search space design.
MDLMs can be significantly improved *without* retraining by using attention weights to guide sampling based on inter-token dependencies.
Document parsing just got a whole lot faster: a simple plug-in method boosts VLM decoding speed by up to 2.2x while also reducing hallucinations.
IRIS achieves real-time rendering and editing of neural scenes by analytically computing ray intersections and aggregating features along the ray, sidestepping slow volumetric sampling and spatial lookups.
Achieve up to 12.63% performance gains on fine-grained visual categorization by adaptively distilling knowledge from VLMs to lightweight classifiers using a task-aligned intermediate teacher.
Text-based speculative decoding falls flat for vision-language models, but ViSkip dynamically adapts to vision tokens for state-of-the-art acceleration.
Ditching the 2D latent grid unlocks 60%+ bitrate reductions in generative video compression by encoding videos into adaptable 1D latent tokens.
Achieve 50% bitrate savings in ultra-low-bitrate image compression by cleverly turning image decoding into a next-frame prediction problem using video diffusion priors.
FPGAs can beat GPUs at dynamically allocating computation for LLM inference, thanks to a new architecture that fuses operations, uses mixed precision, and caches KV values on-chip.
Hybrid Mamba-Transformer models can get 4x faster time to first token and 1.4x higher throughput by disaggregating prefill and decode phases onto specialized accelerator packages.
Unified multimodal models secretly contain separate inference pathways for generation and understanding, and FlashU unlocks this hidden potential for 2x speedup without retraining.
Stop wasting compute: Sharing KV caches across tasks and time can make Vision-Language-Action models run 3.7x faster.
Achieve up to 1.75x faster language model inference by swapping the standard classification head with FlashHead, a training-free retrieval-based alternative.
CacheLib, a popular caching engine, buckles under dynamic multi-tenant workloads, revealing critical limitations in adaptability and fairness that demand a rethink of its design.
Achieve 330x energy reduction in spiking neural networks by adaptively exiting computation based on input complexity using reinforcement learning.