Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.
By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.
Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.
Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.
Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.
By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.
Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.
Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.
Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.
Stop wasting compute on easy and impossible examples: PACED distillation focuses your student model's training on the sweet spot where it actually learns.
Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.
LLM-based ASR can be sped up by 4.4x with minimal accuracy loss by using a CTC encoder to speculatively generate draft transcriptions.
Diffusion Transformers can be accelerated by up to 7x with nearly lossless performance using a training-free method that selectively computes on sparse anchor tokens, outperforming existing temporal acceleration techniques.
Achieve up to 12x greater sample efficiency in reasoning tasks by relaxing strict imitation constraints in on-policy distillation, enabling smaller models to match the performance of much larger ones.
Forget subjective human evaluations: this paper uses a clever knowledge distillation trick to objectively rank XAI methods for NMT, revealing that attention-based attributions beat gradient-based ones.
Subtracting the mean from activations unlocks stable FP4 training for LLMs, closing the performance gap with BF16 without complex spectral methods.
Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.
Get faster long-context LLM inference without sacrificing accuracy: LookaheadKV predicts KV cache importance, outperforming costly draft generation methods by 14.5x.
Multi-robot systems can slash battery consumption by 15% and boost GPU utilization by 50% for large DNN inference by using a hybrid offline-online reinforcement learning strategy to dynamically schedule and distribute DNN module execution.
Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.
Secure multi-tenant LLM serving without sacrificing performance is now possible: CacheSolidarity selectively isolates prefixes, boosting cache reuse by up to 70% and cutting inference latency by 30% compared to blunt-force defenses.
Quantifying the overhead of post-quantum cryptography reveals exactly where the performance bottlenecks lie in real-world TLS 1.3 transactions.
Encoder-only multi-talker ASR can now rival LLM-based systems in accuracy while drastically reducing computational cost, thanks to a novel distillation approach and talker-count routing.
Stop neural network model theft: bind your models to specific hardware using PUFs, rendering them useless on clones.
Monocular depth estimation can now run at 161 FPS on edge devices without sacrificing too much accuracy, thanks to a clever asynchronous architecture that reuses features from a foundation model.
A pipelined FPGA architecture slashes the power consumption of JPEG XS's Intra Pattern Copy displacement vector search, enabling practical hardware deployment for low-latency image compression.
Ditch the slow diffusion grind: Marigold-SSD delivers zero-shot depth completion in a single step, rivaling discriminative models in speed while retaining diffusion's accuracy.
Vision-language models can significantly enhance language models through knowledge distillation, even without direct textual understanding, challenging conventional KD paradigms.
AgentServe achieves up to 2.8x improvement in time-to-first-token and 2.7x in tokens-per-output-token for agentic workloads on a single GPU by strategically isolating prefills and decodes.
Forget fixed decoding parameters: this RL-trained adapter dynamically adjusts LLM sampling strategies at inference, boosting accuracy by up to 10% under tight compute budgets.
Humanoid robots can now walk robustly in the real world using only onboard sensors, thanks to a new diffusion policy that implicitly learns state estimation.
Unlock calibrated uncertainty in Mixture-of-Experts Transformers with VMoER, a Bayesian routing method that slashes calibration error by 94% while barely impacting FLOPs.
DendroNNs offer a 4x energy efficiency boost over existing neuromorphic hardware by mimicking dendritic computation and training via a gradient-free rewiring mechanism.
On-device LLM inference can be sped up by an order of magnitude with a flexible TrustZone-based system that selectively protects memory and the NPU.
A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.
ZipPIR delivers SimplePIR-level throughput without the massive client-side storage, finally making high-performance private information retrieval practical for resource-constrained devices.
On-device LLM inference with PIM is now more practical: PIM-SHERPA resolves memory inconsistencies, slashing memory capacity needs by ~50% without sacrificing performance.
By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.
BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.
Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.
Achieve up to 7.24% code-size reduction by identifying and extracting idempotent backward slices, enabling the merging of non-contiguous instruction sequences within and across functions.
Achieve near-FP32 image restoration performance with an Int8 model that runs at 442 FPS on NVIDIA Jetson Orin, all thanks to a quantization-aware distillation framework that avoids decoder distillation.
Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.
By recombining subgraphs from sparse models without retraining, "model stitching" creates a diverse set of model variants that significantly improves the efficiency of multi-DNN inference on edge SoCs.
Finally, analog joint source-channel coding can be deployed on standard digital transceivers, unlocking the potential of semantic communication on existing infrastructure.
Get up to 24x faster sine/cosine calculations on ESP32 microcontrollers by dynamically switching between fixed-point and floating-point precision.
IoT devices struggling with weak entropy can now get a cryptographic boost from a RISC-V trusted execution environment, turning entropy provisioning into a manageable service.
Achieve RAG efficiency without sacrificing accuracy: LooComp prunes context by identifying and retaining only the most critical sentences for answering a query.
Achieve 45x compression of 3D Gaussian Splatting data while *improving* visual fidelity by over 10% with a streaming-friendly octree-based codec.
Achieve higher accuracy and faster convergence in split learning by intelligently pruning communication channels based on label awareness.
Achieve comparable speech restoration quality with conditional diffusion models using 10x fewer neural network evaluations via a novel iSDE solver.
Ditch slow, iterative ODE solvers for robot control: this method distills flow-based policies into a single-step model that's fast enough for real-time replanning without sacrificing multi-modal action diversity.
Forget ensembling or retraining: model merging lets you Frankenstein LLMs for specialized skills at minimal cost.
Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.
Achieve up to two orders of magnitude reduction in semantic communication rate by strategically incorporating common randomness in a privacy-preserving distributed computation framework.
Token pruning in dense retrieval gets a geometric upgrade: Voronoi cells offer a principled way to shrink your index without sacrificing search quality.
Achieve a 277x speedup in autoregressive video generation by distilling diffusion models with a novel "diagonal distillation" approach that leverages temporal context and mitigates error propagation.
Don't fully retrain your draft model after fine-tuning your LLM: EDA restores speculative decoding performance with significantly less compute by adapting only a small, private component and regenerating training data.
Multi-prototype-guided federated learning overcomes data heterogeneity in edge computing, boosting accuracy and reducing errors compared to single-prototype methods.
VLMs can achieve 7.8x faster prefilling speeds with only a minor accuracy drop by intelligently pruning redundant visual tokens *without* retraining.
On-device fine-tuning of Transformers is now feasible on ultra-low-power, memory-constrained edge devices thanks to TrainDeeploy, which achieves up to 11 trained images per second on a RISC-V SoC.
Mamba-2's efficiency doesn't require custom CUDA kernels: XLA's compiler optimizations are enough to unlock near-optimal performance across diverse hardware.
K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.
MDLMs can be sped up by nearly 10x without retraining, simply by focusing computation on the tokens that actually change between denoising steps.
Caching and speculative transcoding can drastically reduce the computational burden of on-the-fly point cloud transcoding, enabling scalable streaming systems.
Text-to-audio diffusion just got a whole lot faster: SoundWeaver slashes latency by up to 3x without retraining, simply by cleverly reusing similar audio samples.
Squeezing 11x more performance from your datacenter GPUs is now possible for compound inference tasks, thanks to JigsawServe's adaptive model selection and fine-grained spatial partitioning.
By framing prior smoothing as a shrinkage process and applying a micro-diffusion denoising layer, Midicoth achieves more accurate probability estimates in lossless compression, even with limited data.
Ditch the stochasticity: Deterministic pruning slashes LLM size with minimal performance loss, outperforming stochastic methods and accelerating inference.
One-step image synthesis can be dramatically improved by focusing on weight *direction* changes during distillation, not just magnitude.
Constraints don't just limit optimization; they warp the very geometry of improvement, revealing hidden ascent directions.
Recovering types from stripped binaries just got a whole lot faster: XTRIDE achieves up to 2300x speedup in struct recovery while maintaining state-of-the-art accuracy.
Slash blockchain bloat by an order of magnitude: AR-ACE ships compact attestations, not bulky validity proofs, through mempool and relay networks.
Slash blockchain transaction sizes by an order of magnitude with ZK-ACE, which replaces bulky post-quantum signatures with succinct, identity-based zero-knowledge proofs.
MoE models, despite their training efficiency, can be structurally 4.5x slower than quality-matched dense models at inference due to memory fragmentation, especially in long-context scenarios.
Language models can beat FLAC for lossless audio compression at 8-bit and 16-bit, but their advantage shrinks at 24-bit, revealing a challenge for high-fidelity audio.
Get 3.6x faster long-context LLM inference with LycheeCluster's hierarchical KV indexing, which avoids the semantic fragmentation of naive chunking.
Overcome memory bottlenecks in drone-based Synthetic Aperture Radar (SAR) imaging with a new online reconstruction method that processes data incrementally.
VLA models get a 1.73x speedup with only 5-7% overhead thanks to RAPID, a new edge-cloud collaborative inference framework that smartly handles visual noise and motion continuity.
Stop wasting compute: CODA dynamically adjusts reasoning depth based on problem difficulty, slashing token costs by 60% on easy tasks while boosting performance on hard ones.
Forget token counting: this work introduces a semantic prior based on surprisal to compress LLM reasoning traces, achieving better accuracy and fluency than heuristic length penalties.
Speech models can now be quantized to INT4 with near-lossless performance thanks to a new evolution strategy-based calibration method tailored for audio activations.
LLMs can be pruned more effectively by considering the information entropy of their output distribution, surpassing the limitations of traditional cross-entropy-based Taylor pruning.
Ditch 25% of your Transformer's attention parameters without sacrificing performance by swapping the dense output projection for a structured Hadamard transform, and watch your throughput climb.
Achieve near lossless 40% parameter and FLOPs reduction in large vision transformers like CLIP and DINOv2 without finetuning, thanks to adaptive MLP pruning.
LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.
Tree speculative decoding can achieve up to 2.46x speedup on Ascend NPUs, but only if you carefully manage the branch/commit cache and eliminate undefined negative indices.
Forget uncontrolled parameter growth in class incremental learning: GRACE adaptively scales model capacity, achieving state-of-the-art performance with a 73% memory reduction.
LLMs can slash inference costs by 80% without sacrificing accuracy, simply by learning to recognize when their own reasoning is shaky and needs a second opinion.
Credal sets, previously impractical for large models, are now efficiently computable via a "decalibration" method that delivers strong performance in uncertainty-aware tasks.
Squeeze your embodied AI models: DyQ-VLA cuts memory footprint by 70% and speeds up inference by 40% without sacrificing performance, all by dynamically adjusting bit-widths based on real-time kinematic data.
Beat the LLM inference bottleneck: SageSched's uncertainty-aware scheduling boosts efficiency by nearly 30% by predicting output length and balancing compute and memory demands.
Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Achieve state-of-the-art 4-bit LLM quantization accuracy with SERQ, a saliency-aware error reconstruction method that uses a single low-rank matrix, outperforming existing methods while reducing calibration complexity.
LLMs waste 21.8% of their context window on structural inefficiencies, but a demand paging system can slash context consumption by up to 93% without sacrificing performance.
By enabling draft models to "contemplate the future," ConFu achieves significant speedups in speculative decoding, outperforming EAGLE-3 by 8-11% on Llama-3 models.
Protein language models finally scale predictably: Reverse Distillation unlocks consistent gains by distilling large models into nested, Matryoshka-style embeddings guided by smaller, capacity-constrained models.
Achieve nearly 2x speedup in Stable Diffusion 3 by intelligently stitching together large and small diffusion models at both the pixel and timestep level.
Turn energy-intensive crypto mining into a data compression service with Proof-of-Encryption-Work (PoEW), a novel consensus mechanism.
Bridge the trust gap in cloud-based LLM services with AFTUNE, a practical framework that lets you audit proprietary fine-tuning and inference without prohibitive overhead.
Forget full fine-tuning: Low-rank adapters let you adapt speech enhancement models to new acoustic environments on-device, updating less than 1% of parameters for significant quality gains.
Diffusion language models have surprisingly redundant early layers, enabling nearly 20% FLOPs reduction at inference time via layer skipping without sacrificing performance.
Most output-level defenses against LLM knowledge distillation are surprisingly weak, failing to prevent knowledge theft even from naive attackers.
Squeeze 46% more LLM inference throughput from your many-core CPUs with ArcLight, a new architecture that overcomes the cross-NUMA memory access bottleneck.