Search papers, labs, and topics across Lattice.
Model compression, quantization, pruning, distillation, and efficient inference for deployment.
#14 of 24
3
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
Learned image compression finally delivers on its promise: a codec that's not just perceptually superior, but also crushes traditional and learned alternatives in bitrate savings while running blazingly fast on mobile.
Stop training in isolation: LNTrust lets decentralized models learn *who* to trust during training, so they can collaborate effectively at deployment, boosting accuracy and cutting communication costs.
Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.
Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.
Approximate computing can break MoEs in unexpected ways, with dense networks sometimes proving more robust, but careful retraining can unlock surprising efficiency gains in specific architectures.
Suppressing weight outliers via a Hessian-informed additive transformation unlocks >40% perplexity reduction in 2-bit quantized LLMs compared to standard GPTQ.
LLM agents can now autonomously design complex hardware like an LLM inference accelerator with hard-wired TurboQuant support in just 80 hours.
Shuffling activations, a popular defense in secure Transformer inference, crumbles under a new alignment attack that recovers model weights for just $1.
Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.
Forget full fine-tuning: QLoRA on 7B models can match the perplexity of fully fine-tuned smaller models for low-resource languages, while slashing the parameter count by 40x.
Forget backprop and memory lookups: FAAST lets you adapt models at test time with a single forward pass, matching fine-tuning accuracy with massive speed and memory gains.
UniVer achieves state-of-the-art speculative decoding by jointly optimizing multi-step and multi-draft verification, outperforming existing methods by up to 8.5% in acceptance length.
You can distill interpretable Bayesian reasoning about opponent preferences into an 8B language model, outperforming much larger models and enabling detailed auditability of negotiation strategies.
Forget token deletion – Telegraph English rewrites prompts into a symbol-rich, structured dialect that compresses by 50% while actually *improving* accuracy on smaller models.
Choosing between secure multi-party computation (SMPC) and fully homomorphic encryption (FHE) for secure ML depends heavily on the model architecture: FHE excels at regressions and simple networks, while SMPC dominates for complex CNNs.
Lattice-based cryptography's reliance on injected noise for security is more akin to hiding secrets under a rug than truly erasing them, leaving them vulnerable to future quantum attacks.
A clever routing strategy lets a tiny 3B code model outperform a massive 480B model on routine code completion tasks, slashing accelerator usage by 58%.
Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.
Forget PEFT and KD, reprogramming distillation offers a surprisingly effective and robust way to adapt large medical foundation models to diverse downstream tasks.
MARL-optimized collaboration between large and small models in LEO satellites slashes service delays by nearly a third.
Generative recommenders can slash latency by up to 38% simply by dynamically juggling GPU memory between embedding and KV caches, a feat current systems miss.
Run billions of bitwise operations directly in your 3D NAND flash, error-free, using just standard instructions.
RangeGuard lets you tolerate 64+ flipped bits in DNN memory using just 16 bits of parity, without sacrificing accuracy.
On-device LLMs can now drive real-time recommendation improvements, unlocking faster adaptation to evolving user intent without cloud reliance.
Mamba's linear complexity meets perceptual image compression, yielding a lightweight model that rivals GANs and diffusion models in visual quality while being far more efficient.
Exploiting temporal continuity and feature deviations in wearable sensor data lets you adapt activity recognition models on the fly, boosting accuracy while slashing compute costs.
Resource-strapped edge devices can now achieve state-of-the-art face recognition across different sensing modalities thanks to a new lightweight CNN-Transformer architecture.
3D Gaussian Splatting gets a nearly 2x speed boost thanks to a clever bounding box strategy that drastically reduces unnecessary tile intersection checks.
Stop wasting compute on unreliable rollouts and easy frames: Stream-R1 adaptively focuses video diffusion distillation where it matters most, boosting quality without architectural changes or added inference cost.
Save up to 2.79x on LLM serving costs by intelligently distributing models across a diverse fleet of cloud GPUs.
Get 4x faster LLM inference with Budgeted LoRA, which smartly redistributes compute between dense and low-rank pathways during distillation, outperforming standard LoRA in both speed and function-style in-context learning.
Forget massive models: small, locally-deployable language models can achieve surprisingly strong performance on privacy-sensitive clinical information extraction tasks with self-prompting and preference-based optimization.
Stochastic sampling from p-bit Ising models can slash the search effort of CDCL SAT solvers by over 80% on certain problem instances.
A new cryptographic system promises top-level security for IoT gadgets without sacrificing performance, a rare win for resource-constrained devices.
Computation-in-memory combined with lightweight cryptography slashes energy consumption by up to 44% in steganography applications.
Standard federated learning deployments can catastrophically fail with just 5-second latency or 50% packet loss, revealing a fundamental mismatch between FL's communication patterns and default TCP configurations.
Achieve near-identical object detection results compared to the ONNX model while drastically reducing computational cost by implementing a binarized YOLOv3-tiny on a low-cost FPGA.
Dramatically extend the battery life of bioacoustic sensors by embedding a highly accurate CNN classifier directly on a microcontroller, enabling selective recording of target species.
LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.
Pushing speculative decoding to new heights, SpecKV adaptively tunes speculation length based on draft model confidence, achieving a 56% speedup compared to fixed-length speculation, especially crucial for compressed models.
Commodity GPU servers can achieve surprisingly high LLM inference throughput by cleverly orchestrating pipeline parallelism with KV cache offloading.
Strong differential privacy can cause speech classifiers to collapse into near-useless single-class predictors, but a two-stage training process involving distillation can stabilize training.
Autoregressive video generation gets a 6x speed boost without sacrificing quality, thanks to a motion-aware caching strategy that finally respects the fact that not all pixels are created equal.
Attention bottlenecks in long-context decoding? SANTA slashes memory bandwidth demands by stochastically sampling value vectors, achieving 1.5x speedups without sacrificing accuracy.
Cut KV-cache transfer times by up to 32% with SplitZip, a new GPU-friendly lossless compressor that unlocks faster disaggregated LLM serving.
Signal processing practitioners gain a coherent roadmap for deploying sequential Gaussian Processes in real-world systems, bridging the gap between ML advances and practical application.
Token-aware clustering and hierarchical indexing can slash retrieval latency by an order of magnitude without sacrificing accuracy, making multivector retrieval practical at scale.
Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.
Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.