Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.
Signal processing practitioners gain a coherent roadmap for deploying sequential Gaussian Processes in real-world systems, bridging the gap between ML advances and practical application.
Token-aware clustering and hierarchical indexing can slash retrieval latency by an order of magnitude without sacrificing accuracy, making multivector retrieval practical at scale.
Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.
Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.
Signal processing practitioners gain a coherent roadmap for deploying sequential Gaussian Processes in real-world systems, bridging the gap between ML advances and practical application.
Token-aware clustering and hierarchical indexing can slash retrieval latency by an order of magnitude without sacrificing accuracy, making multivector retrieval practical at scale.
Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.
Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.
LLMs can generate recommendations up to 3.1x faster by explicitly modeling token position within items and speculation depth during speculative decoding.
EdgeFM delivers production-grade VLM/LLM inference performance on edge devices, outperforming vendor-specific toolchains by up to 49% while remaining open-source and cross-platform.
Achieve up to 2.5X faster video object removal by focusing DiT computations only on the essential tokens dictated by the mask.
NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.
HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.
Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.
Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.
Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.
Unlock bandwidth-adaptive point cloud transmission with TAFA-GSGC, a single-model codec that delivers up to 9 quality levels from a single bitstream.
Red-teaming long-context LLMs just got a whole lot cheaper: FlashRT slashes the compute and memory costs of prompt injection attacks by up to 7x.
Forget storing full task-specific models – Auto-FlexSwitch compresses the knowledge into tiny, dynamically assembled task vectors, slashing storage costs without sacrificing accuracy.
Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.
Combining diverse AI prediction tools as a Mixture of Experts slashes variance in semi-supervised inference, outperforming standard Prediction-Powered Inference.
Ditch the training data: this method uses a pre-trained diffusion model to jointly compress and transmit images, outperforming classic techniques without any task-specific training.
Get 4x-10x smaller LoRA models for free with a simple post-processing step that doesn't hurt performance.
You can now get real-time (825 FPS) crack detection on UAVs without sacrificing accuracy, thanks to a new attention-enhanced lightweight CNN.
LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.
LLMs can edit code 30% faster and cheaper without sacrificing accuracy, simply by learning to choose between generating full code and structure-aware diffs.
Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.
Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.
Achieve faster VLM inference in bandwidth-constrained edge environments by adaptively compressing visual data, outperforming full-edge and full-cloud solutions without sacrificing semantic accuracy.
Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.
Skewed item distributions in recommendation systems can be tamed with a learnable non-uniform quantization, leading to better codebook utilization and more accurate generative recommendations.
Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.
Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.
Shrinking diffusion LLMs by distilling across different architectures can yield surprisingly strong performance, even boosting code generation scores by 16 points on HumanEval.
Forget coarse sequence-level hacks: LenVM lets you precisely dial in token generation length, boosting a 7B model's length accuracy from 30.9 to 64.8 and crushing closed-source rivals.
By co-evolving experts through bidirectional policy distillation, CoPD achieves all-in-one integration of text, image, and video reasoning, outperforming domain-specific experts and suggesting a new training paradigm.
Frontier models are wasted on routine GUI tasks: a step-level cascade that adaptively invokes stronger models only when lightweight monitors detect progress stalls or semantic drift slashes compute costs without sacrificing performance.
Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.
Quantization crushes large object detection models for edge deployment, but knowledge distillation can resurrect them, even surpassing their original floating-point precision in a much smaller package.
Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.
Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.
Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.
Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.
Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.
Forget slow reranking: this new method compresses documents into embeddings, letting an 8B parameter model run up to 18x faster than smaller models with better accuracy.
SLMs can match the reasoning performance of much larger models by simply re-ranking their own top-K token predictions, eliminating the need for expensive LLM calls at inference time.
Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.
Your ECC implementation might be leaking secrets through power consumption differences between multiplication and squaring, regardless of your multiplication algorithm.
Dynamic quantization, a widely adopted optimization for efficient ML serving, can leak your data to adversaries sharing the same batch.
Smaller models get a bigger speed boost from Speculative Decoding on software engineering tasks, challenging the assumption that larger models always benefit more from inference acceleration techniques.
Achieve 34x compression of 3D Gaussian Splatting models *without* sacrificing rendering quality, and sometimes even improving it.
Black-box knowledge distillation can be significantly improved by synthesizing diverse image priors and using contrastive learning to enhance the distinctions between synthetic samples.
Slash your LLM's carbon footprint by up to 81% without sacrificing performance using a compression pipeline inspired by carbon taxation.
Augmenting few-shot knowledge distillation with adaptively selected, teacher-confident GAN-generated images dramatically boosts student accuracy.
Even when you think you're only teaching a model what *not* to do, sustained gradient alignment can lead to the unintended acquisition of undesirable traits.
SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.
Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.
Integer-only attention is now a viable alternative to floating-point, delivering up to 8.69x speedups and 18.8% energy reduction on Vision Transformers.
Rule extraction from tree ensembles just got 22x faster, without sacrificing accuracy or interpretability.
Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.
Distilling large models into smaller ones can silently sacrifice crucial capabilities like safety and uncertainty awareness, even if headline metrics stay the same.
Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.
Forget fancy distillation losses: simple feature-based knowledge distillation, given enough compute, lets a ResNet-18 student nearly match a ResNet-101 teacher in semantic segmentation.
By injecting basic physics, this method achieves up to 9% accuracy gains in human activity recognition, proving that inductive biases still matter for real-world sensor data.
LLMs from different vendors and sizes secretly speak the same statistical language, enabling a blazing-fast, model-agnostic output verification method.
SNNs can achieve higher accuracy and lower latency by learning the optimal spiking resolution for each layer, rather than relying on predefined burst structures.
By intelligently pruning tokens based on spike timing and activation, Vision SmolMamba achieves state-of-the-art efficiency in spiking neural networks, outperforming even Spiking Mamba.
Tensor networks offer a surprisingly robust and efficient alternative to traditional neural networks for classifying noisy SAR imagery, even under data poisoning attacks.
Diagnose more, charge less: a new VCE pipeline slashes energy consumption by 40% by intelligently skipping bubble-filled frames without sacrificing diagnostic quality.
Overcome the bandwidth bottleneck in remote sensing with a collaborative edge-cloud approach that transmits structural priors, enabling high-fidelity super-resolution and boosting downstream perception tasks even under extreme compression.
Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.
Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.
FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.
Achieve up to 70% faster rendering by distilling XGBoost models into lookup tables that adapt rendering parameters on a per-frame basis with sub-millisecond latency.
Unlock the full potential of your pretrained video diffusion models with a surprisingly simple four-stage post-training framework that drastically improves visual quality, temporal coherence, and instruction following.
Forget prefetching: DAK unlocks up to 3x faster LLM inference by enabling direct GPU access to remote memory, achieving near-optimal system bandwidth utilization.
Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.
Stop leaving 10-70% of your MoE kernel throughput on the table: RaMP dynamically optimizes kernel configuration based on runtime expert routing, achieving up to 1.41x end-to-end speedup in vLLM serving.
Mobile LLM inference just got a whole lot faster: AHASD achieves up to 4.2x throughput and 5.6x energy efficiency gains by intelligently decoupling and managing drafting and verification tasks on a PIM-NPU architecture.
On-device cardiac monitoring is now feasible on ultra-low-power wearables, achieving 98% accuracy at just 8.55mW.
LUT-based hardware architectures can achieve up to 2.2x area reduction for LLM inference by challenging conventional design assumptions and optimizing for activation data types.
Forget GPUs – NVLLM's 3D NAND-centric design slashes LLM inference latency by up to 37.9x on edge devices, making on-device LLMs a real possibility.
RecFlash slashes recommendation inference latency by up to 81% and energy consumption by nearly 92% through smart data remapping in NAND flash memory.
WhisperPipe achieves 3-5x lower latency than existing streaming ASR solutions while consuming significantly less memory, making it a game-changer for real-time applications.
Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.
TetrisG-SDK achieves up to 1.3x faster convolutional layer processing while slashing energy consumption by over 70% in some cases.
CacheFlow slashes LLM serving latency by up to 62% by rethinking KV cache restoration as a 3D-parallel scheduling problem, not just a recompute vs. I/O tradeoff.
Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.
Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.
Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.
Spark Policy Toolkit unlocks scalable policy learning in Spark by guaranteeing consistent results even with distributed execution, finally making it possible to apply complex policy learning techniques to large datasets.
Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.
The secret to effectively pruning LLMs might not be *how* you search for redundant layers, but *what* you're optimizing for.
Edge devices can now achieve up to 494x faster certified robustness with Laplace-Bridged Smoothing, making formally verified AI deployments practical in resource-constrained settings.
Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.
Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.
Not all layers are created equal: pruning the KV cache in a layer-dependent manner significantly boosts long-context LLM performance compared to uniform pruning strategies.
On-device SLMs in mobile apps demand a radical shift: the less the LLM does, the more reliable it becomes.
Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.
Quantum-safe certificates bloat TLS handshakes so much that they measurably degrade web performance, and current CDN optimizations aren't enough to fully compensate.
Forget complex side-channel analysis: a single, machine-checked theorem proves that masked Barrett reduction leaks at most *one bit* of information per wire, offering a universal security guarantee for post-quantum crypto.
Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.
Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.
Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.