March 11 – March 18, 2026

Inference & Quantization - Weekly Roundup

100 papers published across 9 labs.

2% acceleration

Selected Labs publishing this week

Tsinghua AI3 Amazon Science2 Meta AI2 CMU ML2 AI21

Top Papers

Mar 17, 2026

2w ago·also Charlie, McGill

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Distributed Systems & Hardware Inference & Quantization

2w ago·also Charlie, McGill

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mar 16, 2026

University of Zanjan2w ago·also Brandenburg Technical University, Brandenburgische Technische Universität Cottbus, Leibniz, Tallinn University of Technology

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.

Ali Soltan Mohammadi, Ali Mohammadi, Samira Nazari +8

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Mar 18, 2026

2w ago·also HKUST

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.

Daowen Li, Ruixiao Dong, Ying Chen +2

Computer Vision Inference & Quantization

Charuka Herath +32w ago

QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.

Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

All Papers (100)

Mar 18, 2026

2w ago·also HKUST

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Achieve both low-bitrate perceptual video compression and practical scalability with ProGVC, a framework that unifies progressive transmission, efficient entropy coding, and detail synthesis.

Daowen Li, Ruixiao Dong, Ying Chen +2

Computer Vision Inference & Quantization

Charuka Herath +32w ago

QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.

Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Tsinghua AI, PKU, UCLA

Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.

Hengyuan Zhang, Xinrong Chen, Zunhai Su +10

Inference & Quantization Training Efficiency & Optimization

Raghavv Goel +42w ago

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

LLMs can predict multiple tokens in parallel without any training, simply by cleverly probing their embedding space with dynamically generated mask tokens.

Raghavv Goel, Mukul Gagrani, Mingu Lee +2

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

AI22w ago

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.

Jianrui Zhang, Winson Han, Ranjay Krishna +3

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Antônio Junior Alves Caiado +12w ago

Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

Forget scaling laws: dropout robustness in transformers is a lottery, with smaller models sometimes showing perfect stability while larger models crumble under stochastic inference.

Antônio Junior Alves Caiado, Michael Hahsler

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

2w ago·also Charlie

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.

Huamin Chen, Xunzhuo Liu, Yuhan Liu +3

Distributed Systems & Hardware Inference & Quantization Scaling Laws & Emergent Abilities

2w ago

Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Forget training separate models for each compression level; this framework achieves state-of-the-art extreme image compression with flexible bitrate control using a single diffusion-based arbitrary-scale super-resolution model.

Xinning Chai, Zhengxue Cheng, Rong Xie +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Arpit Singh Gautam +12w ago

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.

Arpit Singh Gautam, Saurabh Jha

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also HIT

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.

Ruibo Fan, Xiangrui Yu, Xinglin Pan +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kasra Ahmadi +32w ago

MAED: Mathematical Activation Error Detection for Mitigating Physical Fault Attacks in DNN Inference

Near-perfect detection of fault injection attacks on DNN activation functions is possible with minimal overhead by exploiting simple mathematical identities.

Kasra Ahmadi, S. Aghapour, Mehran Mozaffari Kermani +1

Inference & Quantization Red-Teaming & Adversarial Robustness

Tim Oh2w ago

A Synthesizable RTL Implementation of Predictive Coding Networks

Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.

Tim Oh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sohaib Errabii +52w ago

KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference

KANs get a 50x BitOps reduction without accuracy loss by quantizing their B-splines down to 2-3 bits and using lookup tables.

Sohaib Errabii, Sohaib Errabii, Olivier Sentieys +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

BAIR2w ago·also Microsoft Research

Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads

Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.

Sara Pohland, Sara Pohland, Xenofon Foukas +10

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Multiverse Computing2w ago·also Donostia International Physics Center, Ikerbasque Foundation for Science, Tecnun - University of Navarra

Only relative ranks matter in weight-clustered large language models

LLMs can be drastically compressed without retraining because the relative ordering of weights matters far more than their exact values, opening the door to efficient, training-free compression techniques.

Borja Aizpurua, Sukhbinder Singh, Rom'an Or'us +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Zhongzhu Zhou +82w ago·also BUPT

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Luca Hinkamp +22w ago

RangeAD: Fast On-Model Anomaly Detection

Ditch the separate anomaly detection model: your existing ML model already holds the keys to faster, better anomaly detection.

Luca Hinkamp, Simon Klüttermann, Emmanuel Müller

Inference & Quantization Red-Teaming & Adversarial Robustness

2w ago·also Beihang, Beijing National Research Center for Information, HKU, Institute of Artificial Intelligence

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.

Ziwei Xiang, Fanhu Zeng, Hongjian Fang +6

Computer Vision Inference & Quantization Multimodal Models

Manuel Barusco +32w ago

AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

Achieve state-of-the-art anomaly detection in multi-class and continual learning scenarios with AdapTS, a teacher-student framework that slashes memory overhead by up to 149x compared to existing methods.

Manuel Barusco, Davide Dalle Pezze, Francesco Borsatti +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Shima Yousefi +12w ago

Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.

Shima Yousefi, Saptarshi Debroy

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

Jinho Park +22w ago·also GHz

AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

Achieve 100x radar data compression with only a 1% performance drop by adaptively pruning DCT coefficients based on detection confidence gradients.

Jinho Park, Se Young Chun, Mingoo Seok

Inference & Quantization Robotics & Embodied AI

Amazon Science2w ago

Learning When to Attend: Conditional Memory Access for Long-Context LLMs

LLMs can maintain performance while skipping global attention for 80% of tokens, slashing compute costs and memory footprint in long-context scenarios.

Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Zhou Fang +52w ago

ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models

Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.

Zhou Fang, Jiaqi Wang, Yi Zhou +3

Inference & Quantization Multimodal Models Robotics & Embodied AI

2w ago·also AGI Lab, Westlake-AGI-Lab/CleanStyle

Few-Step Diffusion Sampling Through Instance-Aware Discretizations

Instance-specific timestep schedules can significantly boost diffusion model performance, challenging the reliance on global discretization strategies.

Liangyu Yuan, Ruoyu Wang, Tong Zhao +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

Multi-stage Flow Scheduling for LLM Serving

LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.

Yijun Sun, Xudong Liao, Songrun Xie +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tuowei Wang +32w ago

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.

Tuowei Wang, Liyun Chu, Ruwen Fan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zihao Zheng +112w ago·also Corresponding Author

HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.

Zihao Zheng, Zhihao Mao, Z. Mao +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

Mar 17, 2026

2w ago

OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Forget full finetuning: OPERA's dynamic pruning lets you adapt retrieval models to new domains with better ranking and recall, in half the time.

Haoyang Fang, Shuai Zhang, Yifei Ma +5

Inference & Quantization Recommendation & Information Retrieval Training Efficiency & Optimization

2w ago

Biased Compression in Gradient Coding for Distributed Learning

Biased compression, previously overlooked in distributed learning with gradient coding, can actually boost performance when combined with error feedback to mitigate straggler effects and reduce communication costs.

Chengxi Li, Ming Xiao, Mikael Skoglund

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Qi An Xin Technology Group Inc., Texas Tech University

SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

Achieve personalized generation with cloud-scale reasoning while preserving user privacy, thanks to a novel asymmetric collaboration framework that's also 2x faster.

Hang Lv, Sheng Liang, Yongyue Zhang +5

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

Francesco Pio Monaco +72w ago

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Forget perplexity – ZipCal uses Zipf's law to curate calibration data for LLM compression, matching state-of-the-art performance at 240x the speed.

Francesco Pio Monaco, Francesco Monaco, Elia Cunegatti +5

Data Curation & Synthetic Data Inference & Quantization Training Efficiency & Optimization

2w ago·also Charlie, McGill

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Distributed Systems & Hardware Inference & Quantization

2w ago

Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Shrinking a leading 3D hand mesh reconstruction model by 65% yields a 1.5x speedup with minimal accuracy loss, unlocking real-time performance on resource-constrained devices.

Hunain Ahmed Jillani, Jameel Malik, Didier Stricker

Computer Vision Inference & Quantization Robotics & Embodied AI

2w ago·also Tsinghua AI, Hangzhou Dianzi University, NTU, PKU

Resource Consumption Threats in Large Language Models

Resource consumption vulnerabilities in LLMs can lead to degraded service availability and economic sustainability, demanding a systematic understanding and mitigation approach.

Yuanhe Zhang, Yuanhe Zhang, Xinyue Wang +17

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Charlie, McGill

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ji-Fu Li +62w ago·also Corresponding author. Preprint

BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

MXFP4 quantization just got a whole lot better: BATQuant recovers up to 96.43% of full-precision performance in LLMs and MLLMs, even under aggressive W4A4KV16 settings, by preventing outlier propagation across quantization blocks.

Ji-Fu Li, Manyi Zhang, Xiaobo Xia +4

Inference & Quantization Multimodal Models Training Efficiency & Optimization

A. Zahir +92w ago

vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots

Edge offloading with vAccSOL slashes robot-side power consumption by up to 80% and boosts vision pipeline frame rates by up to 24x, extending the operational lifespan of battery-powered robots.

A. Zahir, Adam Zahir, Michele Gucciardom Falk Selker +7

Computer Vision Inference & Quantization Robotics & Embodied AI

Jiongze Yu +102w ago

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Control video super-resolution with a few keyframes: SparkVSR lets you guide the process and fix artifacts, unlike black-box VSR models.

Jiongze Yu, Jiongze Yu, Xiangbo Gao +8

Computer Vision Inference & Quantization

Junseok Lee +32w ago

CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

Object detectors can be made significantly more robust to domain shifts by distilling knowledge from a teacher network trained on clean data to a student trained on downscaled and corrupted versions of the same data.

Junseok Lee, Sungho Shin, Seongju Lee +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Florian Grivet +12w ago

Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection

Forget brute-force inversion: this study reveals a simple rule for choosing the fastest matrix update method in streaming outlier detection, slashing computation time.

Florian Grivet, Louise Travé-Massuyès

Inference & Quantization Training Efficiency & Optimization

Kristi Topollai +12w ago

Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

Quantizing optimizer states in LLM pre-training introduces "staleness," but strategically timed resets can recover lost performance and reduce memory footprint.

Kristi Topollai, Anna Choromanska

Inference & Quantization Training Efficiency & Optimization

Shin'ya Yamaguchi +32w ago

Parallel In-context Learning for Large Vision Language Models

Overcome the quadratic attention bottleneck in vision-language models with Parallel-ICL, a method that achieves comparable performance to full-context learning while drastically reducing inference time.

Shin'ya Yamaguchi, Daiki Chijiwa, T. Sakao +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Hong Jeong2w ago

Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods

Frozen LLMs can learn to remember things across conversations, even with limited resources, by training adapters to read and write to a continuous latent space memory bank.

Hong Jeong

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Pietro Bonazzi +42w ago

TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

You can now run anomaly detection at 20 FPS with 94% AUROC on a Sony IMX500 sensor, thanks to an 8.7x parameter reduction in a new TinyGLASS architecture.

Pietro Bonazzi, Rafael Sutter, Luigi Capogrosso +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

NVIDIA2w ago·also BOSS Zhipin, ByteDance, Tencent AI, Vipshop

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.

Haidong Rong, Jiashu Yao, Matthias Langer +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tik Yu Yim +42w ago

ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.

Tik Yu Yim, Wenting Tan, Sum Yee Chan +2

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Yelysei Bondarenko +242w ago

Efficient Reasoning on the Edge

Shrinking LLM reasoning for mobile devices is now possible: LoRA adapters, RL-based budget forcing, and KV-cache tricks let Qwen2.5-7B reason efficiently on-device.

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink +22

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

2w ago

Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

A simple orthogonal rotation of the activation space makes LLMs virtually immune to bit-flip attacks, even against targeted single-point faults.

Deng Liu, Songcan Chen, Song Chen

Distributed Systems & Hardware Inference & Quantization Red-Teaming & Adversarial Robustness

Rishaank Gupta2w ago

Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

Forget blindly pruning LLMs: this work shows you can use Sparse Autoencoders to identify and protect the most functionally important components during compression, leading to more robust models.

Rishaank Gupta

Eval Frameworks & Benchmarks Inference & Quantization Interpretability & Mechanistic Interp

Ai Nozaki +32w ago

Dataflow-Oriented Classification and Performance Analysis of GPU-Accelerated Homomorphic Encryption

Blindly applying GPU optimizations to homomorphic encryption can leave nearly 2x performance on the table, as the best strategy hinges on CKKS parameters and GPU architecture.

Ai Nozaki, Takuya Kojima, Hideki Takase +1

Distributed Systems & Hardware Inference & Quantization

Meta AI2w ago

Elastic Sketch under Random Stationary Streams: Limiting Behavior and Near-Optimal Configuration

Elastic-Sketch's performance hinges on stream characteristics and eviction thresholds, but this work cracks the code to near-optimal configuration by deriving closed-form expressions for its limiting behavior under stationary random streams.

Younes Ben Mazziane, Vinay Kumar B. R., Vinay Kumar +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

2w ago

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Squeeze your LLM's KV cache by 82% without significant performance loss using VQKV's novel vector quantization approach.

Yixuan Wang, Qingyu Shi, Jiayu Zhou +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago·also Tsinghua AI, B-Ins), Hangzhou High-Tech Zone (Binjiang, Institute of Blockchain and Data

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Achieve diffusion-level perceptual quality in monocular depth estimation at 40x the speed, by replacing the slow initial diffusion steps with a fast ViT-based depth map and refining in a compact latent space.

Ruizhi Wang, Weihan Li, Zunlei Feng +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Mar 16, 2026

2w ago

Spiking Layer-Adaptive Magnitude-based Pruning

SNNs can be pruned to extreme sparsity without sacrificing accuracy by explicitly controlling temporal distortion across layers and timesteps.

Junqiao Wang, Zhehang Ye, Yuqi Ouyang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Nitin Priyadarshini Shankar +32w ago

Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference

Binary neural networks can now be trained effectively in federated settings, offering a path to low-cost, privacy-preserving edge inference without sacrificing accuracy.

Nitin Priyadarshini Shankar, Soham Lahiri, Sheetal Kalyani +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

School of Informatics2w ago·also HKU, MiLM Plus

ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

Achieve efficient and positionally consistent simultaneous machine translation with LLMs, regardless of the positional encoding method, using a surprisingly simple explicit position allocation strategy.

Yuzhe Shang, Pengzhi Gao, Yazheng Yang +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

L. Krupp +52w ago

This Is Taking Too Long - Investigating Time as a Proxy for Energy Consumption of LLMs

Inference time can reveal the GPU models behind black-box LLM APIs, offering a way to estimate their hidden energy costs.

L. Krupp, Daniel Geissler, Francisco M. Calatrava-Nicolás +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Techno India University2w ago·also Sister Nivedita University

Affordable Precision Agriculture: A Deployment-Oriented Review of Low-Cost, Low-Power Edge AI and TinyML for Resource-Constrained Farming Systems

TinyML for agriculture is trending towards localized inference on microcontrollers, but inconsistent resource reporting is slowing down real-world deployment.

Riya Samanta, Bidyut Saha

Computer Vision Inference & Quantization Open-Source Models & Weights

Dilxat Muhtar +72w ago

When Does Sparsity Mitigate the Curse of Depth in LLMs

Sparsity, often viewed as a means for efficiency, actually unlocks deeper, more effective LLMs by taming variance and boosting layer utilization.

Dilxat Muhtar, Xinyuan Song, S. Pokutta +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Achieve real-time object detection on resource-constrained AR/VR devices by ditching compute-heavy operations for memory lookups inspired by human vision.

Neeraj Solanki, Hong Ding, Sepehr Tabrizchi +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Juan Camilo Soto +32w ago

Efficient Event Camera Volume System

Event cameras get a major efficiency boost: EECVS achieves 2.7x higher throughput and superior generalization in downstream tasks by adaptively compressing event streams using tailored transforms.

Juan Camilo Soto, Ian Noronha, Saru Bharti +1

Computer Vision Inference & Quantization Robotics & Embodied AI

University of Zanjan2w ago·also Brandenburg Technical University, Brandenburgische Technische Universität Cottbus, Leibniz, Tallinn University of Technology

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.

Ali Soltan Mohammadi, Ali Mohammadi, Samira Nazari +8

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

2w ago

LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

Forget complex combinators: a simple multiplication trick can slash LLM latency by 92% and boost throughput by 21%, outperforming production schedulers.

Kaixi Zhang, Rong Chen

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Wei Shao +52w ago

Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark

Machine translation can now safeguard sensitive information during inference thanks to a new task, benchmark datasets, and metrics designed to protect named entities.

Wei Shao, Lemao Liu, Yinqiao Li +3

Eval Frameworks & Benchmarks Inference & Quantization Natural Language Processing

CMU ML2w ago·also Together

Mamba-3: Improved Sequence Modeling using State Space Principles

Mamba-3 delivers a 1.8 point accuracy boost over competing models in downstream language tasks, proving that SSM-inspired techniques can unlock substantial performance gains without sacrificing inference efficiency.

Aakash Lahoti, Kevin Y. Li, Caitlin Wang +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scaling Laws & Emergent Abilities

2w ago

Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling

Squeeze 2x more speed from your conditional flow matching models by optimizing data-noise coupling across minibatches.

Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

2w ago

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Achieve real-time full-body human mesh recovery from a single RGB stream with Fast SAM 3D Body, a 10x speedup over the original without sacrificing accuracy.

Timing Yang, Sicheng He, Hongyi Jing +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Pedro Antunes +12w ago

bitSMM: A bit-Serial Matrix Multiplication Accelerator

For spacecraft-bound neural networks, a new bit-serial matrix multiplication accelerator, bitSMM, delivers impressive GOPS/W on both FPGA and ASIC, promising efficient on-board inference.

Pedro Antunes, Artur Podobas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago·also Corresponding author:, SJTU

Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing

Achieve near-ideal GPU sharing without kernel hacks: DetShare guarantees semantic and performance determinism through GPU coroutines and lightweight context migration.

Zhenyuan Yang, Wenxin Zheng, Mingyu Li

Distributed Systems & Hardware Inference & Quantization

Lukas Hauzenberger +192w ago

Effective Distillation to Hybrid xLSTM Architectures

xLSTM models can now effectively learn from large attention-based models, even outperforming their teachers on some tasks through a novel distillation and merging pipeline.

Lukas Hauzenberger, Lukas Hauzenberger, Niklas Schmidinger +17

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Sijie Li +22w ago

Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

Textual pathways in LVLMs are more sensitive to pruning than visual pathways, implying that you can aggressively prune visual inputs without significantly impacting performance.

Sijie Li, Biao Qian, Jungong Han

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Johannes Gutenberg University Mainz2w ago

Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs

Cuckoo filters on GPUs can now achieve performance rivaling append-only Bloom filters, thanks to a novel lock-free architecture and memory access optimization strategy that closes the gap between static and dynamic approximate membership query structures.

Tim Dortmann, Markus Vieth, B. Schmidt +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Vladyslav Parakhin +12w ago

Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

Multi-agent LLM systems can slash synchronization costs by up to 95% by borrowing cache coherence strategies from chip design.

Vladyslav Parakhin, V. Parakhin

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

CMU ML2w ago·also UW-Madison

LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs

LLMs can run up to 35% faster on chiplet architectures thanks to a new lossless exponent compression technique that slashes inter-chiplet communication overhead.

Miao Sun, Alish Kanani, Kaushik Shroff +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

T. Ruiz +42w ago·also East China University of Science and Technology

FlashSampling: Fast and Memory-Efficient Exact Sampling

Exact sampling in large-vocabulary decoding can be sped up by 19% simply by fusing it into the LM-head matmul, turning a bandwidth bottleneck into a lightweight epilogue.

T. Ruiz, Zhen Qin, Xuyang Shen +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Shovon Niverd Pereira +22w ago

TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

TabKD achieves state-of-the-art data-free knowledge distillation for tabular data by generating synthetic data that maximizes interaction diversity, a critical factor previously overlooked.

Shovon Niverd Pereira, Krishna Khadka, Yu Lei

Inference & Quantization Training Efficiency & Optimization

Gal Dalal +32w ago

More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

Turns out, blindly widening the beam search in your LLM can actually *hurt* performance due to overestimation bias, and the optimal width depends critically on your scorer's signal-to-noise ratio.

Gal Dalal, Assaf Hallak, Gal Chechik +1

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Meta AI2w ago·also Mila

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

Forget exotic attention mechanisms – MobileLLM-Flash achieves up to 1.8x faster LLM prefill on mobile CPUs by smartly pruning and adapting existing architectures for on-device use.

Igor Fedorov, Andrey Gromov, B. Beckerman +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jingyang Li +22w ago

SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression

Get quantitative safety guarantees with adjustable confidence levels for compressed neural networks, even after aggressive quantization and pruning.

Jingyang Li, Fu Song, Guoqiang Li

Inference & Quantization

2w ago

A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression

Achieve >19x compression on high-resolution drone imagery without sacrificing object detection performance by intelligently allocating bitrates with a PPO-trained agent guiding a conditional diffusion model.

Yuming Han, Jooho Kim, Anish Shakya

Computer Vision Inference & Quantization

Fujitsu Research2w ago·also Macquarie

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

By selectively attending to question-relevant information across video frames and memory, QViC-MF achieves state-of-the-art results in long-term video understanding, highlighting the importance of feedback-driven perception.

Sosuke Yamao, Natsuki Miyahara, Yuankai Qi +1

Computer Vision Inference & Quantization Multimodal Models

2w ago

Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

Forget painstakingly tuning RL in the real world - SimDist lets you pre-train a world model in simulation and then rapidly adapt it via supervised learning, slashing data requirements and boosting performance.

Inference & Quantization Robotics & Embodied AI World Models & Planning

2w ago

Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks

Squeezing federated learning through bandwidth-constrained networks? This routing and pruning method boosts accuracy by 12% while slashing latency by 28%.

Xiaoyu He, Weicai Li, Tiejun Lv +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Manipal University Jaipur2w ago

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

LLMs can solve math problems more efficiently by "thinking" silently in their latent space, adaptively refining their reasoning process only as much as needed, and slashing token usage by over 90%.

Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Reasoning & Chain-of-Thought

Yuya Okada +12w ago

Lightweight User-Personalization Method for Closed Split Computing

SALT offers a surprisingly effective way to personalize and harden split computing models in closed environments, using a lightweight adapter that outperforms full fine-tuning while slashing training costs.

Yuya Okada, Takayuki Nishio

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mark Deutel +22w ago

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

Forget training from scratch: PrototypeNAS finds deployable MCU-optimized DNNs in minutes using zero-shot proxies and smart search space design.

Mark Deutel, Simon Geis, Axel Plinge

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Xueyu Zhou +12w ago

DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

MDLMs can be significantly improved *without* retraining by using attention weights to guide sampling based on inter-token dependencies.

Xueyu Zhou, Yangrong Hu

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

2w ago

Efficient Document Parsing via Parallel Token Prediction

Document parsing just got a whole lot faster: a simple plug-in method boosts VLM decoding speed by up to 2.2x while also reducing hallucinations.

Ze Zhao, Meng Li, Zhongwang Lun +5

Computer Vision Inference & Quantization Multimodal Models

Grzegorz Wilczyński +52w ago

IRIS: Intersection-aware Ray-based Implicit Editable Scenes

IRIS achieves real-time rendering and editing of neural scenes by analytically computing ray intersections and aggregating features along the ray, sidestepping slow volumetric sampling and spatial lookups.

Grzegorz Wilczyński, Mikołaj Zieliński, Krzysztof Byrski +3

Computer Vision Inference & Quantization Training Efficiency & Optimization

Zhengxu He +12w ago

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Achieve up to 12.63% performance gains on fine-grained visual categorization by adaptively distilling knowledge from VLMs to lightweight classifiers using a task-aligned intermediate teacher.

Zhengxu He, Zhijian Wu

Computer Vision Inference & Quantization Multimodal Models

2w ago·also (Corresponding author: Rui Meng and Xiaodong, B. Topic Samples Data source(s), Hebei, Hebei University of Technology +5

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Text-based speculative decoding falls flat for vision-language models, but ViSkip dynamically adapts to vision tokens for state-of-the-art acceleration.

Hui Shen, Ping Zhang, Yunta Hsieh +9

Eval Frameworks & Benchmarks Inference & Quantization Multimodal Models

2w ago·also Cohere

Generative Video Compression with One-Dimensional Latent Representation

Ditching the 2D latent grid unlocks 60%+ bitrate reductions in generative video compression by encoding videos into adaptable 1D latent tokens.

Zihan Zheng, Zhaoyang Jia, Naifu Xue +5

Computer Vision Inference & Quantization

Amazon Science2w ago·also K (LPIPS, SJTU

Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

Achieve 50% bitrate savings in ultra-low-bitrate image compression by cleverly turning image decoding into a next-frame prediction problem using video diffusion priors.

Yunuo Chen, Chuqin Zhou, Jiangchuan Li +4

Computer Vision Inference & Quantization

2w ago·also Eastern Institute of Technology, Institute of Digital Twin, Ningbo Key Laboratory of Spatial

SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

FPGAs can beat GPUs at dynamically allocating computation for LLM inference, thanks to a new architecture that fuses operations, uses mixed precision, and caches KV values on-chip.

Zicheng He, Anhao Zhao, Xiaoyu Shen +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Hybrid Mamba-Transformer models can get 4x faster time to first token and 1.4x higher throughput by disaggregating prefill and decode phases onto specialized accelerator packages.

Alish Kanani, Sangwan Lee, Sang-Won Lee +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Unified multimodal models secretly contain separate inference pathways for generation and understanding, and FlashU unlocks this hidden potential for 2x speedup without retraining.

Junlong Ke, Boxue Yang, Yantai Yang +4

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Mar 15, 2026

2w ago·also Tsinghua AI

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Stop wasting compute: Sharing KV caches across tasks and time can make Vision-Language-Action models run 3.7x faster.

Xiangyu Li, Huaizhi Tang, Weijun Wang +1

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Wilhelm Tranheden +42w ago

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Achieve up to 1.75x faster language model inference by swapping the standard classification head with FlashHead, a training-free retrieval-based alternative.

Wilhelm Tranheden, Shahnawaz Ahmed, Devdatt Dubhashi +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

José Peixoto +62w ago

Idiosyncrasies of Programmable Caching Engines

CacheLib, a popular caching engine, buckles under dynamic multi-tenant workloads, revealing critical limitations in adaptability and fairness that demand a rethink of its design.

José Peixoto, Alexis Gonzalez, Janki Bhimani +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Parth Patne +32w ago·also Shahid Bahonar University

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

Achieve 330x energy reduction in spiking neural networks by adaptively exiting computation based on input complexity using reinforcement learning.

Parth Patne, Ali Mahani, Maksim Jenihhin +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization