March 25 – April 1, 2026

Inference & Quantization - Weekly Roundup

53 papers published across 2 labs.

2% acceleration

Selected Labs publishing this week

CMU ML1 Tsinghua AI1

Top Papers

Mar 31, 2026

Wenli Li +51d ago·also Shanghai University

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.

Wenli Li, Kai Zhao, Haoran Jiang +3

Computer Vision Inference & Quantization Multimodal Models

Timon Klein +41d ago

Tucker Attention: A generalization of approximate attention mechanisms

Tucker Attention squeezes an order of magnitude more parameter efficiency out of attention layers, while unifying and simplifying Group Query Attention, Multi-Head Latent Attention, and standard Multi-Head Attention.

Timon Klein, Jonas Kusch, Sebastian Sager +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Raül Pérez-Gonzalo +31d ago

End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.

Raül Pérez-Gonzalo, Andreas Espersen, Søren Forchhammer +1

Computer Vision Inference & Quantization

Luigi Altamura +41d ago

SISA: A Scale-In Systolic Array for GEMM Acceleration

LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.

Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zhijie Tang +21d ago

Reducing Complexity for Quantum Approaches in Train Load Optimization

Radically simpler train loading plans are now possible by implicitly modeling rehandle costs, slashing the complexity of optimization problems.

Zhijie Tang, Albert Nieto-Morales, Arit Kumar Bishwas

Inference & Quantization Training Efficiency & Optimization

All Papers (53)

Mar 31, 2026

Wenli Li +51d ago·also Shanghai University

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.

Wenli Li, Kai Zhao, Haoran Jiang +3

Computer Vision Inference & Quantization Multimodal Models

Timon Klein +41d ago

Tucker Attention: A generalization of approximate attention mechanisms

Timon Klein, Jonas Kusch, Sebastian Sager +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Raül Pérez-Gonzalo +31d ago

End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.

Raül Pérez-Gonzalo, Andreas Espersen, Søren Forchhammer +1

Computer Vision Inference & Quantization

Luigi Altamura +41d ago

SISA: A Scale-In Systolic Array for GEMM Acceleration

LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.

Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zhijie Tang +21d ago

Reducing Complexity for Quantum Approaches in Train Load Optimization

Radically simpler train loading plans are now possible by implicitly modeling rehandle costs, slashing the complexity of optimization problems.

Zhijie Tang, Albert Nieto-Morales, Arit Kumar Bishwas

Inference & Quantization Training Efficiency & Optimization

Sowmya Vajrala +61d ago

Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.

Sowmya Vajrala, Aakash Parmar, Prasanna R +4

Computer Vision Inference & Quantization Multimodal Models

1d ago·also KU, Pioneer Centre for Artificial

An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

Forget ensembles and retraining: estimate LLM uncertainty with just a single forward-backward pass by assuming parameter covariance isotropy.

Nils Grunefeld, J. Frellsen, Christian Hardmeier

Inference & Quantization Training Efficiency & Optimization

C. Goetze +21d ago

Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices

You can shrink a spacecraft anomaly detection model by 97% and still catch almost all the problems.

C. Goetze, Tim Schlippe, Daniel Lakey

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Gabriel Loiseau +41d ago

Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

You can shrink a privacy expert LLM by 4500x and still get human-level privacy judgments.

Gabriel Loiseau, D. Sileo, Damien Riquet +2

Constitutional AI & AI Ethics Inference & Quantization Natural Language Processing

Claudius Pott +31d ago

HPCCFA: Leveraging Hardware Performance Counters for Control Flow Attestation

Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.

Claudius Pott, Luca Wilke, Jan Wichelmann +1

Distributed Systems & Hardware Inference & Quantization

Xaver Fabian +61d ago

Detecting speculative leaks with compositional semantics

Formalizing speculative execution vulnerabilities with compositional semantics allows for automated detection and verification, moving beyond ad-hoc countermeasures.

Xaver Fabian, Marco Guarnieri, Boris Köpf +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Yudong Gao +51d ago

SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

LLM agents actually perform *better* when you strip away the majority of the boilerplate in their skill descriptions, suggesting current context windows are overloaded with irrelevant information.

Yudong Gao, Zongjie Li, Yuanyuanyuan +3

Code Generation & Program Synthesis Inference & Quantization Tool Use & Agents

Jieke Shi +81d ago

Compiling Code LLMs into Lightweight Executables

Run code LLMs 10x faster and with 6x less memory on your laptop: Ditto compiles them into lean, mean, local executables.

Jieke Shi, Junda He, Zhou Yang +6

Code Generation & Program Synthesis Inference & Quantization

Anmin Liu +71d ago·also Department of Electronic Engineering

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.

Anmin Liu, Ruixuan Yang, Huiqiang Jiang +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Payal Fofadiya +11d ago

Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

LLMs can maintain conversational stability and improve retrieval accuracy in long-running interactions by adaptively compressing context, leading to reduced token usage and faster inference.

Payal Fofadiya, Sunil Tiwari

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Inference & Quantization

Chenlong He +61d ago

HLC: A High-Quality Lightweight Mezzanine Codec Featuring High-Throughput Palette

A novel data-dependency-free palette unlocks high-throughput, low-resource mezzanine coding, outperforming JPEG-XS while slashing LUT resource usage in half.

Chenlong He, Leilei Huang, Wei Li +4

Computer Vision Inference & Quantization

Pukyong National University1d ago

Semantic Zone-Based Map Management for Stable AI-Integrated Mobile Robots

Semantic scene understanding can keep your robot from crashing when running LLMs on edge devices.

Huichang Yun, Seungho Yoo

Inference & Quantization Robotics & Embodied AI Tool Use & Agents

CMU ML1d ago·also UT Austin

A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations

Achieve HPC acceleration by emulating FP64 operations with INT8 precision on GPUs, proving that you can boost performance *and* accuracy.

Hang Liu, Junjie Li, Yinzhi Wang +2

Distributed Systems & Hardware Inference & Quantization Scientific Discovery & Drug Design

University of Calgary1d ago

Sustainable AI Assistance Through Digital Sobriety

Turns out, almost half of AI assistant queries in software development are unnecessary, suggesting we're over-relying on these tools for tasks better suited to simpler solutions.

Madeleine Jennings, Novarun Deb, Ronnie de Souza Santos

Constitutional AI & AI Ethics Inference & Quantization

Mar 30, 2026

Yufei Xu +142d ago

HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Scanning every token to focus attention is now passé: HISA prunes irrelevant context blocks *before* token-level scoring, slashing compute without sacrificing selection fidelity.

Yufei Xu, Fanxu Meng, Fan Jiang +12

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2d ago

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.

Kailai Feng, Yuxiang Wei, Bo Chen +6

Computer Vision Inference & Quantization Multimodal Models

Linqian Fan +42d ago

$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Stop handcuffing student diffusion models to their teachers: framing distribution matching as a reward unlocks more stable and performant distillation via RL techniques.

Linqian Fan, Peiqin Sun, Tiancheng Wen +2

Computer Vision Inference & Quantization RLHF & Preference Learning

Chanh Nguyen +22d ago

Trust-Aware Routing for Distributed Generative AI Inference at the Edge

Guaranteeing robust distributed GenAI inference at the edge requires trust-aware routing, and G-TRAC achieves this with sub-millisecond routing latency.

Chanh Nguyen, Erik Elmroth, Erik Elmroth

Distributed Systems & Hardware Inference & Quantization

2d ago·also Peng Cheng Laboratory

GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

Compressing 3D Gaussian Splatting just got a whole lot better: GeoHCC maintains geometric integrity and rendering fidelity by explicitly modeling inter-anchor geometric correlations, outperforming existing anchor-based approaches.

Xuan Deng, Xiandong Meng, Hengyu Man +4

Computer Vision Inference & Quantization

Sravanth Kodavanti +42d ago

EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

Runaway compute costs for diffusion models on GPUs? EdgeDiT slashes parameters by 30% and latency by 40% while maintaining image quality, all on your phone.

Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Institut für Theoretische Physik2d ago·also National High Magnetic Field Laboratory, VTT

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Forget pruning or quantization: MPO decomposition lets you compress a transformer by 13x while retaining 97% accuracy.

Younes Javanmard, Tanmoy Pandit, Masoud Mardani

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Ganesh Pavan Kartikeya Bharadwaj Kolluri +22d ago

On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

You can slash 7-14% of parameters from your SLAM-ASR system by pruning the Whisper encoder and using LoRA, even outperforming the original model in some cases.

Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar

Inference & Quantization Natural Language Processing Speech & Audio

Surendra Pathak2d ago

Efficient Inference of Large Vision Language Models

LVLM inference is ripe for optimization, but current acceleration techniques only scratch the surface.

Surendra Pathak

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

2d ago·also UCL

Compressing Code Context for LLM-based Issue Resolution

LLMs fix more bugs when you feed them *less* code, thanks to a new compression technique that distills context to the minimal, crucial snippets.

Haoxiang Jia, Earl T. Barr, Sergey Mechtaev

Code Generation & Program Synthesis Inference & Quantization

Carlo Marra +52d ago

BlankSkip: Early-exit Object Detection onboard Nano-drones

Skipping frames without objects boosts nano-drone object detection throughput by 24% with negligible accuracy loss.

Carlo Marra, Beatrice Alessandra Motetti, Alessio Burrello +3

Computer Vision Inference & Quantization Robotics & Embodied AI

Chunhang Zheng +42d ago

RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression

Achieve 7.7% better compression than JPEG-XL by using a bit-depth adaptive entropy model for lossless raw image compression.

Chunhang Zheng, Tongda Xu, Mingli Xie +2

Computer Vision Inference & Quantization

Kaiyu Zheng +12d ago

Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective

Quantization-based point cloud compression can lead to severe distortions, but this work demonstrates a new leaf node lossy compression method that significantly outperforms existing octree-based approaches for object point clouds.

Kaiyu Zheng, Huiming Zheng

Computer Vision Inference & Quantization

E.J. Yoon +12d ago

ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

Achieve FP16-level LLM accuracy at 3-bit quantization, unlocking 1.5x faster inference than 4-bit methods on consumer GPUs.

E.J. Yoon, Edward J. Yoon

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Caio Vicentino2d ago

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Hadamard rotations unlock near-lossless 5-bit quantization for LLMs, outperforming standard techniques without calibration data.

Caio Vicentino

Inference & Quantization Training Efficiency & Optimization

Jack Cook +122d ago

Adaptive Block-Scaled Data Types

By cleverly repurposing an unused sign bit, IF4 achieves superior quantization performance compared to NVFP4 without increasing bit-width.

Jack Cook, Jack Cook, Hyemin S. Lee +10

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Yuma Ichikawa +132d ago

OneComp: One-Line Revolution for Generative AI Model Compression

Automating the messy process of post-training quantization, OneComp lets you compress generative AI models with a single line of code.

Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida +11

Inference & Quantization Training Efficiency & Optimization

Zhongping Ji2d ago

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.

Zhongping Ji

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2d ago

Silent Guardians: Independent and Secure Decision Tree Evaluation Without Chatter

Achieve secure outsourced decision tree evaluation without any communication between servers, unlocking faster and more scalable MLaaS deployments.

Jinyuan Li, L. Zhang, Liang Feng Zhang

Distributed Systems & Hardware Inference & Quantization

Kunal Runwal +112d ago

A Semantic Observer Layer for Autonomous Vehicles: Pre-Deployment Feasibility Study of VLMs for Low-Latency Anomaly Detection

A 50x speedup makes VLMs fast enough to serve as a real-time semantic safety net for self-driving cars, but NF4 quantization can cause critical recall failures.

Kunal Runwal, Kunal Runwal, Swaraj Gajare +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

Tsinghua AI2d ago

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.

Yiran Shi, Yi Shi, Dong Guo +12

Inference & Quantization Multimodal Models Robotics & Embodied AI

Zifan He +32d ago

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

LLM inference bottlenecks aren't just compute-bound: heterogeneous GPU-FPGA systems can slash memory processing overheads by up to 2x while simultaneously reducing energy consumption.

Zifan He, Rui Ma, Yizhou Sun +1

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Soutrik Mukherjee +22d ago

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Deploying transformers in real-time just got a whole lot faster: this work achieves up to 64x speedups on GPUs while maintaining accuracy through a novel hybrid precision approach.

Soutrik Mukherjee, S. Cha, Sangwhan Cha

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2d ago·also China Mobile

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Forget GPU-centric All-Reduce: SCIN's switch-based architecture slashes latency by up to 8.7x and boosts LLaMA-2 performance by 34% through in-network quantization.

Aojie Jiang, Kang Zhu, Zhiheng Zhang +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Srivaths Ranganathan +102d ago

Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music

You can boost ranking model performance in low-traffic recommendation systems by directly distilling knowledge from a large-scale, but different, domain like video recommendations.

Srivaths Ranganathan, Nikhil Khani, Shawn Andrews +8

Inference & Quantization Recommendation & Information Retrieval Training Efficiency & Optimization

Alessio Langiu +12d ago

Privacy Guard&Token Parsimony by Prompt and Context Handling and LLM Routing

Cutting LLM costs and ensuring zero data leakage might be two sides of the same contextual compression coin.

Alessio Langiu, A. Langiu

Constitutional AI & AI Ethics Inference & Quantization Red-Teaming & Adversarial Robustness

Mar 29, 2026

Natapong Nitarach3d ago

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Inference-time hacks to boost LLM reasoning are mostly a waste of time: raw model power matters way more.

Natapong Nitarach

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Corresponding author()3d ago

KVSculpt: KV Cache Compression as Distillation

Forget selecting or merging original KV pairs – KVSculpt distills the KV cache into a smaller, optimized representation in continuous embedding space, slashing KL divergence by up to 4.1x.

Bo Jiang, Sian Jin

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Mohamed Amine Bergach3d ago

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Apple's own vDSP FFT library gets smoked by a new implementation that's 29% faster, thanks to a clever two-tier memory model exploiting the GPU's register file and threadgroup memory.

Mohamed Amine Bergach

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3d ago

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Ternary LLMs can run up to 62x faster on CPU and 1.9x faster on CUDA with RSR-core, a new engine that finally brings theoretically fast low-bit matrix multiplication to practical hardware.

Mohsen Dehghankar, Abolfazl Asudeh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Songchen Ma +103d ago

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

Multi-chiplet architectures can unlock significant speedups and memory savings for low-batch MoE inference by dynamically scheduling expert computations across high-bandwidth die-to-die links.

Songchen Ma, Hongyi Li, Weihao Zhang +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mar 27, 2026

M. Zbib +45d ago

TAPS: Task Aware Proposal Distributions for Speculative Sampling

Forget generic pre-training: Speculative decoding gets a serious speed boost when your draft model is a specialist trained on data matching the target task.

M. Zbib, M. Bazzi, Ammar Mohanna +2

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Mar 26, 2026

Yijiong Yu +46d ago

Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio

LLMs can maintain performance while processing longer contexts, thanks to a new compression method that intelligently adjusts the compression ratio based on the information density of the input.

Yijiong Yu, Shuai Yuan, Jie Zheng +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Xiaofeng Mao +66d ago

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Forget training on long videos – PackForcing achieves state-of-the-art long-video generation by cleverly compressing the KV-cache into Sink, Mid, and Recent tokens, enabling 24x temporal extrapolation from short-video training.

Xiaofeng Mao, Shaohao Rui, Kaining Ying +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Search

Inference & Quantization - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (53)