Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
Achieve LSTM acceleration on embedded FPGAs with 11.89 GOP/s/W energy efficiency by tuning architectural parameters.
Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.
Laplacian DP and adaptive quantization can slash federated learning communication costs by over 50% without sacrificing accuracy or privacy, even with non-IID data.
Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.
Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.
Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.
Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.
Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.
Spark Policy Toolkit unlocks scalable policy learning in Spark by guaranteeing consistent results even with distributed execution, finally making it possible to apply complex policy learning techniques to large datasets.
Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.
The secret to effectively pruning LLMs might not be *how* you search for redundant layers, but *what* you're optimizing for.
Edge devices can now achieve up to 494x faster certified robustness with Laplace-Bridged Smoothing, making formally verified AI deployments practical in resource-constrained settings.
Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.
Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.
Not all layers are created equal: pruning the KV cache in a layer-dependent manner significantly boosts long-context LLM performance compared to uniform pruning strategies.
On-device SLMs in mobile apps demand a radical shift: the less the LLM does, the more reliable it becomes.
Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.
Quantum-safe certificates bloat TLS handshakes so much that they measurably degrade web performance, and current CDN optimizations aren't enough to fully compensate.
Forget complex side-channel analysis: a single, machine-checked theorem proves that masked Barrett reduction leaks at most *one bit* of information per wire, offering a universal security guarantee for post-quantum crypto.
Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.
Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.
Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.
Compiling and executing YOLO-NAS on an FPGA-based accelerator is now possible, opening doors for real-time object detection in safety-critical applications like aeronautics.
Forget A100s for long-context LLMs – Salca achieves up to 74x better energy efficiency with a sparsity-aware hardware accelerator.
Vanilla on-policy distillation falls apart in multi-turn settings due to compounding errors, but a simple curriculum on trajectory length fixes it, even letting students beat their teachers.
Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.
Squeezing intermediate tensors with FP8 quantization and adaptive transforms can nearly double the throughput of tensor-parallel LLM training without sacrificing accuracy.
Laplacian DP and adaptive quantization can slash federated learning communication costs by over 50% without sacrificing accuracy or privacy, even with non-IID data.
Scale up your nearest neighbor search without blowing your budget: this work shows how to use Dask to parallelize Product Quantization and Inverted Indexing, achieving accuracy comparable to single-machine methods on much larger datasets.
VARestorer distills a text-to-image VAR model into a one-step super-resolution network, achieving state-of-the-art image quality with a 10x speedup.
Forget compressing entire tokens – selectively routing *parts* of tokens based on query relevance unlocks better compression-quality tradeoffs in LoRA-adapted transformers.
Halving the parameter count of LLMs without sacrificing performance is now possible with Hyperloop Transformers, thanks to looped layers and hyper-connected residual streams.
Autoregressive video diffusion models can achieve faster decoding, lower memory footprint, and higher quality long-horizon generations by learning to attend to only the most salient spatiotemporal blocks.
Recurrent Transformers let you trade model depth for width, slashing KV cache memory footprint and inference latency without sacrificing performance.
LLM agents are wasting up to 60k tokens per turn on unnecessary tool schemas – Tool Attention slashes this "Tools Tax" by 95% and unlocks truly scalable agentic workflows.
Achieve high-fidelity image enhancement on mobile devices even after quantization by training a model that anticipates and adapts to low-precision representations.
Forget flat numerical compression – GS-Quant unlocks better knowledge graph completion by generating discrete codes that mirror the hierarchical nature of human reasoning.
LLMs can be both faster and smarter: pre-learned reasoning skills cut down token usage while boosting accuracy on coding and math problems.
Achieve competitive video copy detection accuracy with descriptors orders of magnitude smaller and inference speeds exceeding 11k samples per second by replacing floating-point operations with a learned Boolean circuit.
Get LLM-boosted recommendations without the LLM latency: this distillation method lets you bake rich user profiles into efficient sequential recommenders.
Deploying language models in the Global South requires bridging the gap between multilingual NLP and edge computing, two fields that have largely evolved independently despite their shared goals.
LLM agent distillation leads to surprisingly high rates of behavioral mimicry, with some student models exhibiting tool-use habits *more* similar to their teachers than the teacher's own family members.
Current 3D Gaussian Splatting methods are too unpredictable for real-world use, but YOGO makes them deterministic and production-ready.
Ditch the cache: Prototype-Based Test-Time Adaptation (PTA) boosts vision-language model accuracy by nearly 4% while *doubling* inference speed compared to existing cache-based methods.
Edge devices can now learn continuously from visual data with 40x faster speed and 380x better energy efficiency, thanks to a novel FPGA accelerator design.
SIMD parallelism can finally unlock substantial speedups in large-number arithmetic by rethinking algorithms around data-parallel operations, yielding up to 19.3% throughput gains in scientific computing.
Reduce deadline misses and server switching by explicitly accounting for tail risk and stability in edge server selection.
A server-driven adaptive sampling approach slashes power consumption in wireless iBCIs by 40mW while *improving* decoding accuracy.
On-device LLM inference gets a massive speed and energy boost by adaptively streaming only the most expensive parts of the KV cache from the cloud.
Forget simple offloading – this framework intelligently decomposes LLM tasks across devices and edge servers, slashing latency and boosting rewards in congested WiFi networks.
By dynamically injecting frequency-aware n-gram features, X-GRAM achieves state-of-the-art accuracy with smaller embedding tables, offering a practical path to scaling memory-augmented architectures.
Clock skew as small as 5ms can break causality in observability data from distributed AI inference systems, even when the system is working perfectly.
SpanDec achieves state-of-the-art NER accuracy with significantly improved throughput, proving that you don't need to exhaustively process every possible span to achieve top performance.
Exact attention over billion-token sequences is now possible on a single GPU, thanks to a novel streaming approach that avoids out-of-memory errors without approximation.
Differentiable landmark selection for shortest-path heuristics can provably preserve admissibility, achieving near-optimal coverage and faster query times compared to traditional methods.
Forget pruning by variance: high-variance activations in transformers are surprisingly uncorrelated with predictive power.
Optimizing AI inference can boost throughput and reduce latency, revealing strategies that enhance performance under real-world traffic conditions.
Leaking user queries through disk access patterns in sensitive ANN search? Onyx flips the script on prior work to achieve up to 9.9x cost reduction and 12.3x latency improvement.
Forgetting isn't a bug, it's a feature: selectively pruning LLM agent memories boosts efficiency by 8%, sharpens content quality by 29%, and eliminates security risks entirely.
Get calibrated anomaly detection from time series foundation models without any fine-tuning, even when the data distribution shifts.
Diffusion language models withstand aggressive quantization better than autoregressive models, suggesting a path to efficient deployment.
Reasoning across languages doesn't have to break the bank: a new framework slashes token costs by over 50% while maintaining accuracy, especially boosting performance in low-resource languages.
Machine-checked proofs now guarantee the security of arithmetic masking in NTT pipelines, but watch out: even a single lapse in "fresh masking" can expose vulnerabilities, as seen in the Adams Bridge accelerator.
LLMs can bootstrap accurate and efficient log parsing by synthesizing regex masks, enabling a hybrid approach that outperforms both heuristic and LLM-only methods.
Achieve 2.6x faster autoregressive world model inference without retraining by caching and selectively reusing block-level residuals across generation chunks.
Distilling knowledge from a Mamba-based teacher network significantly boosts the performance of quantized INT8 super-resolution models, enabling high-quality image enhancement on resource-constrained mobile devices.
Achieve near-perfect (96.35% Dice) maxillary sinus segmentation from X-rays with limited labeled data by distilling knowledge from GAN-refined pseudo-labels.
Achieving stable bitrate tracking in learned video compression can reduce average bitrate errors to as low as 2.13%, transforming how we manage video quality under constraints.
Fine-grained management of speculative decoding phases can boost LLM serving throughput by over 50% and cut latency nearly in half.
Forget hours-long simulations: EnergAIzer slashes GPU power estimation time to seconds while maintaining accuracy, by exploiting structured patterns in AI kernel optimizations.
Stacking SRAM cells slashes leakage power without adding transistors.
Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.
Speed up your RAG pipelines by up to 37% without sacrificing accuracy by speculatively retrieving documents based on query homology.
Flipping the script on RowHammer defense, PVAC counts activations on victim rows instead of aggressors, slashing false positives and boosting performance.
Hybrid Policy Distillation achieves superior performance by harmonizing the strengths of forward and reverse KL divergence, transforming the landscape of knowledge distillation for LLMs.
Get calibrated uncertainty estimates from your scientific foundation models in minutes, not days, with this simple attention randomization trick.
Naive attention-based filtering for edge-cloud inference is suboptimal under tight bandwidth constraints; prioritizing semantic diversity in transmitted embeddings yields surprisingly large accuracy gains.
TurboQuant's claimed advantages over RaBitQ in quantization don't hold up under rigorous, reproducible comparison, raising questions about its practical utility.
Compact, gradient-free MARS models can now outperform state-of-the-art gradient-based sequence models like Mamba, while slashing training times from hours to milliseconds.
LLMs can be aggressively quantized to W(1+1)A4 without significant performance degradation using a surprisingly simple three-stage distillation approach.
Forget fancy quantization schemes – a simple token-wise INT4 quantization with Hadamard rotation is all you need to nearly match FP16 accuracy in LLM serving, without sacrificing throughput.
Ditch the slow lane: $R^2$-dLLM turbocharges diffusion language models by slashing decoding steps by up to 75% without sacrificing quality.
Forget noisy samples, RL can now directly optimize the *gradients* of diffusion distillation, leading to SOTA few-step image generation.
You can now dial a knob to make your LLM either super-distillable or completely un-distillable, opening up new possibilities for both efficient knowledge transfer and robust model protection.
Achieve 50% parameter reduction in LLaMA-2-7B with minimal performance loss and no fine-tuning, thanks to a new global gating-based structured pruning method.
Similarity alone is a poor guide for LLM depth pruning: jointly considering representational similarity *and* transformation difference unlocks significantly better compression.
Forget scaling laws: strategically equipping small language models with tools delivers a better performance/cost tradeoff than simply scaling up or deploying multi-agent systems.
Achieve near-lossless performance in autonomous driving VLMs with 90% token reduction – without any training.
Multi-modal models can now better handle distribution shifts thanks to a new method that explicitly models how different categories are distributed, even when the modalities are asymmetrical.
Attention's quadratic complexity is no longer a bottleneck: DASH-KV achieves linear O(N) inference without sacrificing accuracy by reformulating attention as an approximate nearest-neighbor search.
Forget chasing the biggest LLM – this benchmark reveals that smaller models (<2B params) can deliver 3x better energy efficiency and faster ROI in real-world industry deployments.
Ditch the slow "think-first-then-translate" paradigm: ReflectMT internalizes reflection, delivering faster and better machine translation in a single pass.
Federated learning can be sped up by 74% without sacrificing security, thanks to a novel hardware-assisted approach that cleverly decouples cryptographic setup from the active training phase.
Neural networks made of logic gates can now be directly compiled to silicon, achieving impressive MNIST classification speeds with low power consumption.
Achieve state-of-the-art small object detection in high-resolution imagery while slashing inference time by 20-25% using adaptive slicing.
The LPCVC 2025 winning solutions showcase surprisingly effective strategies for balancing accuracy and efficiency in edge-based computer vision, pushing the boundaries of what's possible on resource-constrained devices.
Achieve over an order of magnitude speedup in 3D Gaussian Splatting by adaptively scaling Gaussians based on their color contribution, without sacrificing visual fidelity.
Achieve LSTM acceleration on embedded FPGAs with 11.89 GOP/s/W energy efficiency by tuning architectural parameters.
Instant AI assistants are now feasible on smartwatches: 8M-parameter models can kickstart responses locally, hiding cloud latency with surprisingly high quality.
Test-time training can finally scale for large reasoning models: TEMPO unlocks sustained performance gains by interleaving policy refinement with periodic critic recalibration, boosting accuracy by over 18% on challenging benchmarks.
LLMs break in two fundamentally different ways when pushed to extreme quantization: either through gradual information loss or sudden functional breakdown of key components.
Unlock privacy-preserving eye-tracking analysis with garbled circuits, enabling secure scanpath comparison without revealing sensitive gaze data.
Training 3D Gaussian Splatting models on edge devices is now practical: this method slashes peak memory consumption by 80% without sacrificing visual quality.
Lightweight UAV detectors get a surprisingly large boost in accuracy and robustness from a carefully tuned Mosaic and HSV augmentation pipeline, outperforming more complex methods.
Ditch bulky SAR image reconstruction: this online edge-mapping technique slashes memory and compute costs for UAV-based target recognition.