March 4 – March 11, 2026

Inference & Quantization - Weekly Roundup

100 papers published across 4 labs.

2% acceleration

Selected Labs publishing this week

Tsinghua AI6 Microsoft Research2 ETH1 CMU ML1

Top Papers

Mar 9, 2026

3w ago

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.

Dongxu Zhang, Hongqiang Lin, Yiding Sun +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Mar 11, 2026

Shuai Dong +53w ago

In-Memory ADC-Based Nonlinear Activation Quantization for Efficient In-Memory Computing

By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.

Shuai Dong, Junyi Yang, Biyan Zhou +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zhenghai You +33w ago

Training-Free Multi-Step Inference for Target Speaker Extraction

Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.

Zhenghai You, Ying Shi, Lantian Li +1

Inference & Quantization Speech & Audio

Yuquan Li +33w ago

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.

Yuquan Li, Lianjie Ma, Han Ding +1

Inference & Quantization Multimodal Models Robotics & Embodied AI

All Papers (100)

Mar 11, 2026

Shuai Dong +53w ago

In-Memory ADC-Based Nonlinear Activation Quantization for Efficient In-Memory Computing

By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.

Shuai Dong, Junyi Yang, Biyan Zhou +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zhenghai You +33w ago

Training-Free Multi-Step Inference for Target Speaker Extraction

Iteratively refining target speaker extraction *without* retraining a model unlocks significant performance gains, offering a flexible and efficient approach to speech separation.

Zhenghai You, Ying Shi, Lantian Li +1

Inference & Quantization Speech & Audio

Yuquan Li +33w ago

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.

Yuquan Li, Lianjie Ma, Han Ding +1

Inference & Quantization Multimodal Models Robotics & Embodied AI

Yuan Xu +43w ago

PACED: Distillation at the Frontier of Student Competence

Stop wasting compute on easy and impossible examples: PACED distillation focuses your student model's training on the sweet spot where it actually learns.

Yuan Xu, Hejian Sang, Zhengze Zhou +2

Inference & Quantization Training Efficiency & Optimization

Yukiko Uchino +23w ago

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.

Yukiko Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

G. Saon +53w ago

Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts

LLM-based ASR can be sped up by 4.4x with minimal accuracy loss by using a CTC encoder to speculatively generate draft transcriptions.

G. Saon, Samuel Thomas, Takashi Fukuda +3

Inference & Quantization Natural Language Processing Speech & Audio

3w ago

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Diffusion Transformers can be accelerated by up to 7x with nearly lossless performance using a training-free method that selectively computes on sparse anchor tokens, outperforming existing temporal acceleration techniques.

Wenhao Sun, Zhaoqiang Liu

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Jongwoo Ko +43w ago

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Achieve up to 12x greater sample efficiency in reasoning tasks by relaxing strict imitation constraints in on-policy distillation, enabling smaller models to match the performance of much larger ones.

Jongwoo Ko, Sara Abdali, Young Jin Kim +2

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

3w ago

Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Forget subjective human evaluations: this paper uses a clever knowledge distillation trick to objectively rank XAI methods for NMT, revealing that attention-based attributions beat gradient-based ones.

A. Nourbakhsh, Salima Lamsiyah, Adelaide Danilov +1

Inference & Quantization Interpretability & Mechanistic Interp Natural Language Processing

3w ago·also China Academy of Space Technology

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Subtracting the mean from activations unlocks stable FP4 training for LLMs, closing the performance gap with BF16 without complex spectral methods.

Hengjie Cao, Zhendong Huang, Mengyi Chen +15

Inference & Quantization Training Efficiency & Optimization

Yonas Atinafu +23w ago

Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI

Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.

Yonas Atinafu, Henry Lin, Robin Cohen

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

SamsungLabs3w ago·also KRAFTON, Samsung Electronics Co. Ltd.

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Get faster long-context LLM inference without sacrificing accuracy: LookaheadKV predicts KV cache importance, outperforming costly draft generation methods by 14.5x.

Jinwoo Ahn, Jinwoo Ahn, Ingyu Seong +11

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

M. Anwar +53w ago

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

Multi-robot systems can slash battery consumption by 15% and boost GPU utilization by 50% for large DNN inference by using a hybrid offline-online reinforcement learning strategy to dynamically schedule and distribute DNN module execution.

M. Anwar, Anuradha Ravi, Indrajeet Ghosh +3

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Kadir-Kaan Özer +23w ago

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

IMDEA Software Institute3w ago

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

Secure multi-tenant LLM serving without sacrificing performance is now possible: CacheSolidarity selectively isolates prefixes, boosting cache reuse by up to 70% and cutting inference latency by 30% compared to blunt-force defenses.

Panagiotis Georgios Pennas, Konstantinos Papaioannou, Marco Guarnieri +1

Distributed Systems & Hardware Inference & Quantization

David G'omez-Cambronero +23w ago

Layered Performance Analysis of TLS 1.3 Handshakes: Classical, Hybrid, and Pure Post-Quantum Key Exchange

Quantifying the overhead of post-quantum cryptography reveals exactly where the performance bottlenecks lie in real-world TLS 1.3 transactions.

David G'omez-Cambronero, D. Munteanu, Ana Isabel Gonz'alez-Tablas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

Encoder-only multi-talker ASR can now rival LLM-based systems in accuracy while drastically reducing computational cost, thanks to a novel distillation approach and talker-count routing.

Hao Shi, Yusuke Fujita, Roman Koshkin +4

Inference & Quantization Natural Language Processing Speech & Audio

Software Competence Center Hagenberg3w ago

A PUF-Based Approach for Copy Protection of Intellectual Property in Neural Network Models

Stop neural network model theft: bind your models to specific hardware using PUFs, rendering them useless on clones.

Daniel Dorfmeister, Martin Schwandtner, Hannes Sochor

Distributed Systems & Hardware Inference & Quantization

Lianjie Ma +53w ago

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

Monocular depth estimation can now run at 161 FPS on edge devices without sacrificing too much accuracy, thanks to a clever asynchronous architecture that reuses features from a foundation model.

Lianjie Ma, Yuquan Li, Bi-Ye Jiang +3

Computer Vision Inference & Quantization Robotics & Embodied AI

Qiyue Chen +53w ago

An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS

A pipelined FPGA architecture slashes the power consumption of JPEG XS's Intra Pattern Copy displacement vector search, enabling practical hardware deployment for low-latency image compression.

Qiyue Chen, Yao Li, Jie Tao +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also ETH

Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Ditch the slow diffusion grind: Marigold-SSD delivers zero-shot depth completion in a single step, rivaling discriminative models in speed while retaining diffusion's accuracy.

Computer Vision Inference & Quantization Training Efficiency & Optimization

3w ago

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Vision-language models can significantly enhance language models through knowledge distillation, even without direct textual understanding, challenging conventional KD paradigms.

Ayan Sengupta, Shantanu Dixit, Md. Shad Akhtar +1

Inference & Quantization Multimodal Models Training Efficiency & Optimization

3w ago

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

AgentServe achieves up to 2.8x improvement in time-to-first-token and 2.7x in tokens-per-output-token for agentic workloads on a single GPU by strategically isolating prefills and decodes.

Yuning Zhang, Yan Yan, Nan Yang +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Mar 10, 2026

Chloe H. Su +53w ago

Learning Adaptive LLM Decoding

Forget fixed decoding parameters: this RL-trained adapter dynamically adjusts LLM sampling strategies at inference, boosting accuracy by up to 10% under tight compute budgets.

Chloe H. Su, Zhe Ye, Samuel Tenka +3

Inference & Quantization Natural Language Processing

Milo Carroll +43w ago

SCDP: Learning Humanoid Locomotion from Partial Observations via Mixed-Observation Distillation

Humanoid robots can now walk robustly in the real world using only onboard sensors, thanks to a new diffusion policy that implicitly learns state estimation.

Milo Carroll, Tianhu Peng, Lingfan Bao +2

Inference & Quantization Robotics & Embodied AI

Albus Yizhuo Li +13w ago

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

Unlock calibrated uncertainty in Mixture-of-Experts Transformers with VMoER, a Bayesian routing method that slashes calibration error by 94% while barely impacting FLOPs.

Albus Yizhuo Li, Matthew Wicker

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago·also UZH

DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data

DendroNNs offer a 4x energy efficiency boost over existing neuromorphic hardware by mimicking dendritic computation and training via a gradient-free rewiring mechanism.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Yinpeng Wu +53w ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

On-device LLM inference can be sped up by an order of magnitude with a flexible TrustZone-based system that selectively protects memory and the NPU.

Yinpeng Wu, Yitong Chen, Lixiang Wang +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.

J. Prescott, Thanathai Lertpetchpun, Shrikanth S. Narayanan

Inference & Quantization Red-Teaming & Adversarial Robustness Speech & Audio

R. Mahdavi +23w ago

ZipPIR: High-throughput Single-server PIR without Client-side Storage

ZipPIR delivers SimplePIR-level throughput without the massive client-side storage, finally making high-performance private information retrieval practical for resource-constrained devices.

R. Mahdavi, Abdulrahman Diaa, Florian Kerschbaum

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

Sunjung Lee +103w ago

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

On-device LLM inference with PIM is now more practical: PIM-SHERPA resolves memory inconsistencies, slashing memory capacity needs by ~50% without sacrificing performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Xiamen University3w ago·also Tsinghua AI, Chongqing, Openharmony Community

Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices

By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.

Xufeng Yang, Tingting Tan, Jingxin Hu +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Chaodong Xiao +13w ago

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.

Chaodong Xiao, Zhengqiang Zhang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

3w ago

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.

Zhiye Tang, Qiudan Zhang, Junhui Hou +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

Universidade Federal de Minas Gerais3w ago

Idempotent Slices with Applications to Code-Size Reduction

Achieve up to 7.24% code-size reduction by identifying and extracting idempotent backward slices, enabling the merging of non-contiguous instruction sequences within and across functions.

Rafael Alvarenga de Azevedo, Daniel Augusto Costa de Sa, Rodrigo Caetano Rocha

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

S. M. A. Sharif +33w ago

Decoder-Free Distillation for Quantized Image Restoration

Achieve near-FP32 image restoration performance with an Int8 model that runs at 442 FPS on NVIDIA Jetson Orin, all thanks to a quantization-aware distillation framework that avoids decoder distillation.

S. M. A. Sharif, Abdur Rehman, Seongwan Kim +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago·also Artificial Intelligence Thrust, Beijing National Research Center for Infor-, CAS

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.

Yinrui Ren, Jinjing Zhu, Zhuoxiao Li +7

Computer Vision Inference & Quantization Multimodal Models

University of St Andrews3w ago·also Durham University

Multi-DNN Inference of Sparse Models on Edge SoCs

By recombining subgraphs from sparse models without retraining, "model stitching" creates a diverse set of model variants that significantly improves the efficiency of multi-DNN inference on edge SoCs.

Jiawei Luo, Simon Dobson

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Shumin Yao +53w ago·also (Corresponding author: Rui Meng and Xiaodong

Unlocking High-Fidelity Analog Joint Source-Channel Coding on Standard Digital Transceivers

Finally, analog joint source-channel coding can be deployed on standard digital transceivers, unlocking the potential of semantic communication on existing infrastructure.

Shumin Yao, Yaping Sun, Nan Ma +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Elian Alfonso Lopez Preciado3w ago

Dynamic Precision Math Engine for Linear Algebra and Trigonometry Acceleration on Xtensa LX6 Microcontrollers

Get up to 24x faster sine/cosine calculations on ESP32 microcontrollers by dynamically switching between fixed-point and floating-point precision.

Elian Alfonso Lopez Preciado

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Arttu Paju +53w ago

External entropy supply for IoT devices employing a RISC-V Trusted Execution Environment

IoT devices struggling with weak entropy can now get a cryptographic boost from a RISC-V trusted execution environment, turning entropy provisioning into a manageable service.

Arttu Paju, Alejandro Cabrera Aldaya, Nicola Tuveri +3

Distributed Systems & Hardware Inference & Quantization

Thao Do +43w ago

LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression

Achieve RAG efficiency without sacrificing accuracy: LooComp prunes context by identifying and retaining only the most critical sentences for answering a query.

Thao Do, Dinh Phu Tran, An Vo +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

3w ago

ProGS: Towards Progressive Coding for 3D Gaussian Splatting

Achieve 45x compression of 3D Gaussian Splatting data while *improving* visual fidelity by over 10% with a streaming-friendly octree-based codec.

Zhiye Tang, Lingzhuo Liu, Shengjie Jiao +4

Computer Vision Inference & Quantization

3w ago

Exploiting Label-Aware Channel Scoring for Adaptive Channel Pruning in Split Learning

Achieve higher accuracy and faster convergence in split learning by intelligently pruning communication channels based on label awareness.

Jialei Tan, Xiangming Cai, Ruoxi Zhu +2

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Bunlong Lay +13w ago

A Fast Solver for Interpolating Stochastic Differential Equation Diffusion Models for Speech Restoration

Achieve comparable speech restoration quality with conditional diffusion models using 10x fewer neural network evaluations via a novel iSDE solver.

Bunlong Lay, Timo Gerkmann

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Liding Zhang +73w ago

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ditch slow, iterative ODE solvers for robot control: this method distills flow-based policies into a single-step model that's fast enough for real-time replanning without sacrificing multi-modal action diversity.

Liding Zhang, Yu Fu, Kaixin Bai +5

Inference & Quantization Multimodal Models Robotics & Embodied AI

3w ago

Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

Forget ensembling or retraining: model merging lets you Frankenstein LLMs for specialized skills at minimal cost.

Inference & Quantization Open-Source Models & Weights Training Efficiency & Optimization

Hsiao-Ying Huang +13w ago

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.

Hsiao-Ying Huang, Cheng-Han Chiang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Onur Günlü3w ago

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

Achieve up to two orders of magnitude reduction in semantic communication rate by strategically incorporating common randomness in a privacy-preserving distributed computation framework.

Onur Günlü

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also Sorbonne

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Token pruning in dense retrieval gets a geometric upgrade: Voronoi cells offer a principled way to shrink your index without sacrificing search quality.

Yash Kankanampati, Yuxuan Zong, Nadi Tomeh +2

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

SphereLab.aidiagdistill3w ago·also × speedup over the undistilled model., UC Merced, Westlake

Streaming Autoregressive Video Generation via Diagonal Distillation

Achieve a 277x speedup in autoregressive video generation by distilling diffusion models with a novel "diagonal distillation" approach that leverages temporal context and mitigates error propagation.

Jinxiu Liu, Xuanming Liu, Xuan Liu +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Xiamen University3w ago·also Shanghai Innovation, TeleAI, USTC

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Don't fully retrain your draft model after fine-tuning your LLM: EDA restores speculative decoding performance with significantly less compute by adapting only a small, private component and regenerating training data.

Luxi Lin, Zhihang Lin, Zhanpeng Zeng +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Luyao Zou +53w ago

A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System

Multi-prototype-guided federated learning overcomes data heterogeneity in edge computing, boosting accuracy and reducing errors compared to single-prototype methods.

Luyao Zou, Hayoung Oh, Chu Myaet Thwal +3

Distributed Systems & Hardware Inference & Quantization

3w ago

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

VLMs can achieve 7.8x faster prefilling speeds with only a minor accuracy drop by intelligently pruning redundant visual tokens *without* retraining.

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang +1

Computer Vision Inference & Quantization Multimodal Models

Run Wang +43w ago

TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

On-device fine-tuning of Transformers is now feasible on ultra-low-power, memory-constrained edge devices thanks to TrainDeeploy, which achieves up to 11 trained images per second on a RISC-V SoC.

Run Wang, Victor J. B. Jung, Philip Wiese +2

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

Mamba-2's efficiency doesn't require custom CUDA kernels: XLA's compiler optimizations are enough to unlock near-optimal performance across diverse hardware.

Cosmo Santoni, C. Santoni

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

BAIR3w ago·also NVIDIA, Tsinghua AI, Soyeon Caren Han is the corresponding

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.

Shuo Yang, Shuo Yang, Haocheng Xi +21

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Mar 9, 2026

3w ago

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

MDLMs can be sped up by nearly 10x without retraining, simply by focusing computation on the tokens that actually change between denoising steps.

Younjoo Lee, Jungho Lee, Junghoo Lee +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Leibniz University Hannover3w ago·also UGent

Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds

Caching and speculative transcoding can drastically reduce the computational burden of on-the-fly point cloud transcoding, enabling scalable streaming systems.

Michael Rudolph, Matthias De Fr'e, Matthias De Fré +4

Computer Vision Distributed Systems & Hardware Inference & Quantization

Ayush Barik +93w ago

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Text-to-audio diffusion just got a whole lot faster: SoundWeaver slashes latency by up to 3x without retraining, simply by cleverly reusing similar audio samples.

Ayush Barik, S. Stoica, Sofia Stoica +7

Inference & Quantization Natural Language Processing Speech & Audio

Sriram Devata +23w ago

Serving Compound Inference Systems on Datacenter GPUs

Squeezing 11x more performance from your datacenter GPUs is now possible for compound inference tasks, thanks to JigsawServe's adaptive model selection and fine-grained spatial partitioning.

Sriram Devata, Rahul Singh, Sarita Adve

Distributed Systems & Hardware Inference & Quantization

R. Tacconelli3w ago

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

By framing prior smoothing as a shrinkage process and applying a micro-diffusion denoising layer, Midicoth achieves more accurate probability estimates in lossless compression, even with limited data.

R. Tacconelli

Inference & Quantization Natural Language Processing

Weiyu Huang +53w ago

Deterministic Differentiable Structured Pruning for Large Language Models

Ditch the stochasticity: Deterministic pruning slashes LLM size with minimal performance loss, outperforming stochastic methods and accelerating inference.

Weiyu Huang, Pengle Zhang, Xiaolu Zhang +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago·also USTC

WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

One-step image synthesis can be dramatically improved by focusing on weight *direction* changes during distillation, not just magnitude.

Lei Wang, Yang Cheng, Senmao Li +3

Computer Vision Inference & Quantization Training Efficiency & Optimization

Changkai Li3w ago

First-Order Geometry, Spectral Compression, and Structural Compatibility under Bounded Computation

Constraints don't just limit optimization; they warp the very geometry of improvement, revealing hidden ascent directions.

Changkai Li

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago

Practical Type Inference: High-Throughput Recovery of Real-World Structures and Function Signatures

Recovering types from stripped binaries just got a whole lot faster: XTRIDE achieves up to 2300x speedup in struct recovery while maintaining state-of-the-art accuracy.

Lukas Seidel, Sam Thomas, K. Rieck +1

Code Generation & Program Synthesis Inference & Quantization

Jian Sheng Wang3w ago

ACE-GF-based Attestation Relay for PQC - Lightweight Mempool Propagation Without On-Path Proofs

Slash blockchain bloat by an order of magnitude: AR-ACE ships compact attestations, not bulky validity proofs, through mempool and relay networks.

Jian Sheng Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jian Sheng Wang3w ago

ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems

Slash blockchain transaction sizes by an order of magnitude with ZK-ACE, which replaces bulky post-quantum signatures with succinct, identity-based zero-knowledge proofs.

Jian Sheng Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Vignesh Adhinarayanan +13w ago

The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

MoE models, despite their training efficiency, can be structurally 4.5x slower than quality-matched dense models at inference due to memory fragmentation, especially in long-context scenarios.

Vignesh Adhinarayanan, N. Jayasena

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Phillip Long +23w ago

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Language models can beat FLAC for lossless audio compression at 8-bit and 16-bit, but their advantage shrinks at 24-bit, revealing a challenge for high-fidelity audio.

Phillip Long, Zachary Novack, Chris Donahue

Eval Frameworks & Benchmarks Inference & Quantization Speech & Audio

3w ago·also SYSU

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Get 3.6x faster long-context LLM inference with LycheeCluster's hierarchical KV indexing, which avoids the semantic fragmentation of naive chunking.

Dongfang Li, Zixuan Liu, Gang Lin +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Conor Flynn +23w ago

Online Sparse Synthetic Aperture Radar Imaging

Overcome memory bottlenecks in drone-based Synthetic Aperture Radar (SAR) imaging with a new online reconstruction method that processes data incrementally.

Conor Flynn, Radoslav Ivanov, Birsen Yazici

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also Corresponding Author

RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models

VLA models get a 1.73x speedup with only 5-7% overhead thanks to RAPID, a new edge-cloud collaborative inference framework that smartly handles visual noise and motion continuity.

Zihao Zheng, Sicheng Tian, Hangyu Cao +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

3w ago

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Stop wasting compute: CODA dynamically adjusts reasoning depth based on problem difficulty, slashing token costs by 60% on easy tasks while boosting performance on hard ones.

Siye Wu, Jian Xie, Yikai Zhang

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Fabio Valerio Massoli3w ago

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Forget token counting: this work introduces a semantic prior based on surprisal to compress LLM reasoning traces, achieving better accuracy and fluency than heuristic length penalties.

Fabio Valerio Massoli

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Lucas Rakotoarivony3w ago

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

Speech models can now be quantized to INT4 with near-lossless performance thanks to a new evolution strategy-based calibration method tailored for audio activations.

Lucas Rakotoarivony

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Yijun Zhu +23w ago

High-Fidelity Pruning for Large Language Models

LLMs can be pruned more effectively by considering the information entropy of their output distribution, surpassing the limitations of traditional cross-entropy-based Taylor pruning.

Yijun Zhu, Jianxin Wang, Chengchao Shen

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Shubham Aggarwal +23w ago

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Ditch 25% of your Transformer's attention parameters without sacrificing performance by swapping the dense output projection for a structured Hadamard transform, and watch your throughput climb.

Shubham Aggarwal, L. Kumar, Lokendra Kumar

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Chengchao Shen3w ago

Adaptive MLP Pruning for Large Vision Transformers

Achieve near lossless 40% parameter and FLOPs reduction in large vision transformers like CLIP and DINOv2 without finetuning, thanks to adaptive MLP pruning.

Chengchao Shen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

3w ago

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.

Dongxu Zhang, Hongqiang Lin, Yiding Sun +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Chang Han +33w ago

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Tree speculative decoding can achieve up to 2.46x speedup on Ascend NPUs, but only if you carefully manage the branch/commit cache and eliminate undefined negative indices.

Chang Han, Yijie Hu, Jinglin Liu +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Adrian Garcia-Castañeda +33w ago

Grow, Assess, Compress: Adaptive Backbone Scaling for Memory-Efficient Class Incremental Learning

Forget uncontrolled parameter growth in class incremental learning: GRACE adaptively scales model capacity, achieving state-of-the-art performance with a 73% memory reduction.

Adrian Garcia-Castañeda, Jon Irureta, Jon Imaz +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Microsoft Research3w ago·also Vanderbilt

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

LLMs can slash inference costs by 80% without sacrificing accuracy, simply by learning to recognize when their own reasoning is shaky and needs a second opinion.

Juming Xiong, Kevin Guo, Congning Ni +3

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

3w ago·also Munich Center for Machine Learning

Efficient Credal Prediction through Decalibration

Credal sets, previously impractical for large models, are now efficiently computable via a "decalibration" method that delivers strong performance in uncertainty-aware tasks.

Paul Hofman, Timo Löhr, Maximilian Muschalik +2

Inference & Quantization Training Efficiency & Optimization

Hangyu Cao +13w ago

DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action Models

Squeeze your embodied AI models: DyQ-VLA cuts memory footprint by 70% and speeds up inference by 40% without sacrificing performance, all by dynamically adjusting bit-widths based on real-time kinematic data.

Hangyu Cao, Hailong Zou

Inference & Quantization Multimodal Models Robotics & Embodied AI

Tsinghua AI3w ago

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Beat the LLM inference bottleneck: SageSched's uncertainty-aware scheduling boosts efficiency by nearly 30% by predicting output length and balancing compute and memory demands.

Zhenghao Gan, Z. Gan, Yichen Bao +4

Distributed Systems & Hardware Inference & Quantization

CMU ML3w ago·also Microsoft Research, Brandeis, Glasgow, USC

CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference

Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.

Zijian Shen, Wenyu Zhao, Boyuan Wang +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Microsoft Research3w ago·also Baidu

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.

Noah Golowich, Dhruv Rohatgi, Raghav Singhal +2

Inference & Quantization Reasoning & Chain-of-Thought

University3w ago·also Yonsei

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Achieve state-of-the-art 4-bit LLM quantization accuracy with SERQ, a saliency-aware error reconstruction method that uses a single low-rank matrix, outperforming existing methods while reducing calibration complexity.

Yeonsik Park, Hyeonseong Kim, Seungkyu Choi

Inference & Quantization Training Efficiency & Optimization

3w ago

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

LLMs waste 21.8% of their context window on structural inefficiencies, but a demand paging system can slash context consumption by up to 93% without sacrificing performance.

Tony Mason

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

ConFu: Contemplate the Future for Better Speculative Sampling

By enabling draft models to "contemplate the future," ConFu achieves significant speedups in speculative decoding, outperforming EAGLE-3 by 8-11% on Llama-3 models.

Zongyue Qin, Raghavv Goel, Mukul Gagrani +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Mar 8, 2026

Darius Catrina +33w ago

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Protein language models finally scale predictably: Reverse Distillation unlocks consistent gains by distilling large models into nested, Matryoshka-style embeddings guided by smaller, capacity-constrained models.

Darius Catrina, Christian Bepler, Samuel Sledzieski +1

Inference & Quantization Scaling Laws & Emergent Abilities Scientific Discovery & Drug Design

Tsinghua AI3w ago

HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

Achieve nearly 2x speedup in Stable Diffusion 3 by intelligently stitching together large and small diffusion models at both the pixel and timestep level.

Desen Sun, Desen Sun, Jason Hon +4

Computer Vision Inference & Quantization Training Efficiency & Optimization

Chong Guan3w ago

PoEW:Encryption as Consensus and Enabling Data Compression Services?

Turn energy-intensive crypto mining into a data compression service with Proof-of-Encryption-Work (PoEW), a novel consensus mechanism.

Chong Guan

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also Beijing Institute of Computer Technology

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

Bridge the trust gap in cloud-based LLM services with AFTUNE, a practical framework that lets you audit proprietary fine-tuning and inference without prohibitive overhead.

Heng Jin, Chaoyu Zhang, Hexuan Yu +4

Distributed Systems & Hardware Inference & Quantization Open-Source Models & Weights

Longbiao Cheng +13w ago

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

Forget full fine-tuning: Low-rank adapters let you adapt speech enhancement models to new acoustic environments on-device, updating less than 1% of parameters for significant quality gains.

Longbiao Cheng, Shih-Chii Liu

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Sudhanshu Agrawal +23w ago

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

Diffusion language models have surprisingly redundant early layers, enabling nearly 20% FLOPs reduction at inference time via layer skipping without sacrificing performance.

Sudhanshu Agrawal, Chris Lott, Fatih Porikli

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Interpretability & Mechanistic Interp

3w ago

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Most output-level defenses against LLM knowledge distillation are surprisingly weak, failing to prevent knowledge theft even from naive attackers.

Eval Frameworks & Benchmarks Inference & Quantization Red-Teaming & Adversarial Robustness

3w ago

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

Squeeze 46% more LLM inference throughput from your many-core CPUs with ArcLight, a new architecture that overcomes the cross-NUMA memory access bottleneck.

Yuzhuang Xu, Wanxiang Che

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization