March 18 – March 25, 2026

Inference & Quantization - Weekly Roundup

54 papers published across 6 labs.

2% acceleration

Selected Labs publishing this week

Microsoft Research2 Tsinghua AI2 Meta AI1 AI21 BAIR1

Top Papers

Mar 19, 2026

Pranay Anchuri +71w ago

Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference

Cut inference verification costs by 1000x with a sampling-based cryptographic approach that catches adversarial attacks on Llama-2-7B in milliseconds.

Pranay Anchuri, Matteo Campanelli, Paul Cesaretti +5

Distributed Systems & Hardware Inference & Quantization

Chen Zhang +71w ago

STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Foundation models trained on audio, general time series, and brain signals can be distilled into a single, powerful encoder for scientific time series, unlocking performance gains on par with task-specific training.

Chen Zhang, Liwei Liu, Jun Tao +5

Inference & Quantization Scientific Discovery & Drug Design Speech & Audio

1w ago

Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection

TVLA misses subtle side-channel leakage in neural networks, but a new statistical test closes the gap.

Ján Mikulec, J. Mikulec, Jakub Breier +2

Inference & Quantization

Minsoo Cheong +71w ago

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Diffusion language models can achieve up to 26x inference speedups with almost no accuracy loss, thanks to a clever entropy-based KV caching strategy that avoids costly full forward passes.

Minsoo Cheong, Minsoo Cheong, Donghyun Son +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

1w ago·also Shenzhen Loop Area Institude

UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

LLMs can maintain generation quality in long-context scenarios while using significantly less context, simply by adaptively allocating context based on uncertainty.

Lang Zhou, Shuxuan Li, Zhuohao Li +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

All Papers (54)

Mar 19, 2026

Pranay Anchuri +71w ago

Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference

Cut inference verification costs by 1000x with a sampling-based cryptographic approach that catches adversarial attacks on Llama-2-7B in milliseconds.

Pranay Anchuri, Matteo Campanelli, Paul Cesaretti +5

Distributed Systems & Hardware Inference & Quantization

Chen Zhang +71w ago

STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Chen Zhang, Liwei Liu, Jun Tao +5

Inference & Quantization Scientific Discovery & Drug Design Speech & Audio

1w ago

Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection

TVLA misses subtle side-channel leakage in neural networks, but a new statistical test closes the gap.

Ján Mikulec, J. Mikulec, Jakub Breier +2

Inference & Quantization

Minsoo Cheong +71w ago

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Diffusion language models can achieve up to 26x inference speedups with almost no accuracy loss, thanks to a clever entropy-based KV caching strategy that avoids costly full forward passes.

Minsoo Cheong, Minsoo Cheong, Donghyun Son +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

1w ago·also Shenzhen Loop Area Institude

UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

LLMs can maintain generation quality in long-context scenarios while using significantly less context, simply by adaptively allocating context based on uncertainty.

Lang Zhou, Shuxuan Li, Zhuohao Li +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Kristján Orri Ragnarsson +31w ago

Performance Testing of ChaCha20-Poly1305 for Internet of Things and Industrial Control System devices

Securing legacy industrial protocols with modern encryption like ChaCha20-Poly1305 is far more practical than previously thought, adding single-digit percentage overhead to latency-sensitive applications.

Kristján Orri Ragnarsson, Kristj'an Orri Ragnarsson, J. Mallett +1

Distributed Systems & Hardware Inference & Quantization

Grant Wilkins +51w ago

From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning

Accurately simulate LLM inference power consumption at scale – from individual GPUs to entire datacenters – with a framework that learns from real-world traces and generalizes to unseen configurations.

Grant Wilkins, Grant Wilkins, Fiodar Kazhamiaka +3

Distributed Systems & Hardware Inference & Quantization

1w ago

Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Forget massive SRAMs: this work shows that clever data streaming and compute/transfer overlap can yield 22x speedups for transformer inference, even with standard PCIe interconnects.

Qunyou Liu, Marina Zapater, David Atienza

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zhilin Guo +251w ago

Matryoshka Gaussian Splatting

Get continuous level-of-detail rendering in 3D Gaussian Splatting without sacrificing top-end quality – no architectural changes needed.

Zhilin Guo, Zhilin Guo, Boqiao Zhang +23

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization+1

1w ago·also Microsoft Research, Stevens

Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.

Yifan Sui, Han Zhao, Rui Ma +4

Inference & Quantization Tool Use & Agents

Meta AI1w ago

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.

Arushi Rai, Qiang Zhang, Hanqing Zeng +4

Inference & Quantization Reasoning & Chain-of-Thought RLHF & Preference Learning

1w ago

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Forget fixed decoding strategies – RL can learn a lightweight policy to adapt LLM sampling *at test time*, boosting summarization quality by up to 88% without retraining the LLM.

Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid +1

Inference & Quantization Natural Language Processing RLHF & Preference Learning

1w ago·also Institute of Software

Confidential Databases Without Cryptographic Mappings

Confidential databases can be 78x faster by ditching crypto in the query path.

Wenxuan Huang, Zhanbo Wang, Mingyu Li

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jonathan Cook +31w ago

Impact of Differentials in SIMON32 Algorithm for Lightweight Security of Internet of Things

Unlocking new high-probability differentials in SIMON32 cracks open avenues for more efficient cryptanalysis, pushing past current state-of-the-art round limits.

Jonathan Cook, S. U. Rehman, M. A. Khan +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Jialiang Kang +51w ago

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Text-to-image synthesis just got almost 4x faster without sacrificing image quality, thanks to a clever twist on Speculative Jacobi Decoding that keeps the generation process moving even when initial drafts are rejected.

Jialiang Kang, J. Kang, Han Shu +3

Computer Vision Inference & Quantization

Longfei Liu +91w ago

EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Compact ViTs can now rival or surpass CNN-based architectures like YOLO for edge-based object detection, instance segmentation, and pose estimation, thanks to task-specialized distillation.

Longfei Liu, Yongjie Hou, Yang Li +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Jiayi Luo +91w ago

Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Ditch the training: SVOO achieves up to 1.93x speedup in video generation with sparse attention by exploiting the intrinsic, layer-specific sparsity patterns of attention without any fine-tuning.

Jiayi Luo, Jiayu Chen, Jiayu Chen +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Yida Zhang +41w ago

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference

Achieve nearly 3x faster LLM inference by intelligently splitting the workload between edge devices and the cloud, without any training.

Yida Zhang, Zhiyong Gao, Shuaibing Yue +2

Distributed Systems & Hardware Inference & Quantization

Yuegui Huang +51w ago

DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Edge devices can now run MoEs in real-time thanks to a dynamic quantization scheme that prioritizes important experts and critical layers.

Yuegui Huang, Zhiyuan Fang, Weiqi Luo +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Jonathan Lys +91w ago

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

Discrete diffusion models can now generate more diverse text without sacrificing quality, thanks to a new decoding method that explicitly optimizes for diversity during beam search.

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup +7

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Peihang Wu +31w ago·also Shenzhen University of Advanced

Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.

Peihang Wu, Zehong Chen, Lijia Xu +1

Computer Vision Inference & Quantization Multimodal Models

Yuxiang Lu +131w ago·also Corresponding author

FASTER: Rethinking Real-Time Flow VLAs

Flow-based VLAs can react to environmental changes ten times faster by adaptively prioritizing near-term actions during sampling, unlocking unprecedented real-time responsiveness.

Yuxiang Lu, Yuxiang Lu, Zhe Liu +11

Inference & Quantization Multimodal Models Robotics & Embodied AI

1w ago

SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

Training speculative decoding models just got an order of magnitude faster, unlocking real-world deployment with a new open-source framework and a suite of production-ready draft models.

Shenggui Li, Chao Wang, Yikai Zhu +29

Inference & Quantization Open-Source Models & Weights Training Efficiency & Optimization

Jonah Leshin +31w ago·also Project VAIL

Behavioral Fingerprints for LLM Endpoint Stability and Identity

LLM endpoints can appear "healthy" according to traditional metrics while undergoing subtle behavioral shifts detectable by monitoring output distributions, highlighting a critical gap in current reliability practices.

Jonah Leshin, Manish Shah, Ian Timmis +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Zikang Ding +51w ago

Functional Subspace Watermarking for Large Language Models

LLM watermarks can now survive fine-tuning, quantization, and distillation thanks to a new method that embeds them in a stable functional subspace.

Zikang Ding, Junhao Li, Suling Wu +3

Inference & Quantization Natural Language Processing Open-Source Models & Weights

Elad Yoshai +51w ago

CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

Dramatically speed up histopathology super-resolution by adaptively routing image tiles through a flow-matching network, achieving near-lossless quality at a fraction of the compute.

Elad Yoshai, Elad Yoshai, Ariel D. Yoshai +3

Computer Vision Inference & Quantization Training Efficiency & Optimization

Tsinghua AI1w ago

6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Video diffusion models can be aggressively quantized down to 6-bit precision with minimal quality loss by dynamically adapting the bit-width of each layer based on its temporal stability.

Rundong Su, Jintao Zhang, Zhihang Yuan +3

Computer Vision Inference & Quantization Training Efficiency & Optimization

Mar 18, 2026

2w ago·also HKUST

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.

Daowen Li, Ruixiao Dong, Ying Chen +2

Computer Vision Inference & Quantization

Charuka Herath +32w ago

QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.

Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Tsinghua AI, PKU, UCLA

Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.

Hengyuan Zhang, Xinrong Chen, Zunhai Su +10

Inference & Quantization Training Efficiency & Optimization

Raghavv Goel +42w ago

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

LLMs can predict multiple tokens in parallel without any training, simply by cleverly probing their embedding space with dynamically generated mask tokens.

Raghavv Goel, Mukul Gagrani, Mingu Lee +2

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

AI22w ago

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.

Jianrui Zhang, Winson Han, Ranjay Krishna +3

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Antônio Junior Alves Caiado +12w ago

Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

Forget scaling laws: dropout robustness in transformers is a lottery, with smaller models sometimes showing perfect stability while larger models crumble under stochastic inference.

Antônio Junior Alves Caiado, Michael Hahsler

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

2w ago·also Charlie

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.

Huamin Chen, Xunzhuo Liu, Yuhan Liu +3

Distributed Systems & Hardware Inference & Quantization Scaling Laws & Emergent Abilities

2w ago

Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.

Xinning Chai, Zhengxue Cheng, Rong Xie +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Arpit Singh Gautam +12w ago

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.

Arpit Singh Gautam, Saurabh Jha

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also HIT

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.

Ruibo Fan, Xiangrui Yu, Xinglin Pan +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kasra Ahmadi +32w ago

MAED: Mathematical Activation Error Detection for Mitigating Physical Fault Attacks in DNN Inference

Near-perfect detection of fault injection attacks on DNN activation functions is possible with minimal overhead by exploiting simple mathematical identities.

Kasra Ahmadi, S. Aghapour, Mehran Mozaffari Kermani +1

Inference & Quantization Red-Teaming & Adversarial Robustness

Tim Oh2w ago

A Synthesizable RTL Implementation of Predictive Coding Networks

Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.

Tim Oh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sohaib Errabii +52w ago

KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference

KANs get a 50x BitOps reduction without accuracy loss by quantizing their B-splines down to 2-3 bits and using lookup tables.

Sohaib Errabii, Sohaib Errabii, Olivier Sentieys +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

BAIR2w ago·also Microsoft Research

Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads

Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.

Sara Pohland, Sara Pohland, Xenofon Foukas +10

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Multiverse Computing2w ago·also Donostia International Physics Center, Ikerbasque Foundation for Science, Tecnun - University of Navarra

Only relative ranks matter in weight-clustered large language models

LLMs can be drastically compressed without retraining because the relative ordering of weights matters far more than their exact values, opening the door to efficient, training-free compression techniques.

Borja Aizpurua, Sukhbinder Singh, Rom'an Or'us +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Zhongzhu Zhou +82w ago·also BUPT

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Luca Hinkamp +22w ago

RangeAD: Fast On-Model Anomaly Detection

Ditch the separate anomaly detection model: your existing ML model already holds the keys to faster, better anomaly detection.

Luca Hinkamp, Simon Klüttermann, Emmanuel Müller

Inference & Quantization Red-Teaming & Adversarial Robustness

2w ago·also Beihang, Beijing National Research Center for Information, HKU, Institute of Artificial Intelligence

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.

Ziwei Xiang, Fanhu Zeng, Hongjian Fang +6

Computer Vision Inference & Quantization Multimodal Models

Manuel Barusco +32w ago

AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.

Manuel Barusco, Davide Dalle Pezze, Francesco Borsatti +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Shima Yousefi +12w ago

Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.

Shima Yousefi, Saptarshi Debroy

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

Jinho Park +22w ago·also GHz

AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

Achieve 100x radar data compression with only a 1% performance drop by adaptively pruning DCT coefficients based on detection confidence gradients.

Jinho Park, Se Young Chun, Mingoo Seok

Inference & Quantization Robotics & Embodied AI

Amazon Science2w ago

Learning When to Attend: Conditional Memory Access for Long-Context LLMs

LLMs can maintain performance while skipping global attention for 80% of tokens, slashing compute costs and memory footprint in long-context scenarios.

Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Zhou Fang +52w ago

ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models

Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.

Zhou Fang, Jiaqi Wang, Yi Zhou +3

Inference & Quantization Multimodal Models Robotics & Embodied AI

2w ago·also AGI Lab, Westlake-AGI-Lab/CleanStyle

Few-Step Diffusion Sampling Through Instance-Aware Discretizations

Instance-specific timestep schedules can significantly boost diffusion model performance, challenging the reliance on global discretization strategies.

Liangyu Yuan, Ruoyu Wang, Tong Zhao +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

Multi-stage Flow Scheduling for LLM Serving

LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.

Yijun Sun, Xudong Liao, Songru Xie +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tuowei Wang +32w ago

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.

Tuowei Wang, Liyun Chu, Ruwen Fan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zihao Zheng +112w ago·also Corresponding Author

HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.

Zihao Zheng, Z. Mao, Zhihao Mao +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

Search

Inference & Quantization - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (54)