April 24 – May 1, 2026

Inference & Quantization - Weekly Roundup

100 papers published across 7 labs.

Selected Labs publishing this week

NVIDIA2 Tsinghua AI2 Apple ML1 CMU ML1 DeepMind1

Top Papers

Apr 27, 2026

Iizalaarab Elhaimeur +33w ago

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Apr 30, 2026

Daniel Waxman +53w ago

Sequential Inference for Gaussian Processes: A Signal Processing Perspective

Signal processing practitioners gain a coherent roadmap for deploying sequential Gaussian Processes in real-world systems, bridging the gap between ML advances and practical application.

Daniel Waxman, Daniel Waxman, Fernando Llorente +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

University of Pisa & ISTI–CNR3w ago·also ISTI–CNR, University of Pisa

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Token-aware clustering and hierarchical indexing can slash retrieval latency by an order of magnitude without sacrificing accuracy, making multivector retrieval practical at scale.

Silvio Martinico, Silvio Martinico, Franco Maria Nardini +5

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

3w ago·also BU, Cornell, NTT Physics and Informatics Laboratories

Physical Foundation Models: Fixed hardware implementations of large-scale neural networks

Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.

Logan G. Wright, Tianyu Wang, Tatsuhiro Onodera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Xubin Luo +13w ago

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework

Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.

Xubin Luo, Yang Cheng

Distributed Systems & Hardware Inference & Quantization

All Papers (100)

Apr 30, 2026

Daniel Waxman +53w ago

Sequential Inference for Gaussian Processes: A Signal Processing Perspective

Signal processing practitioners gain a coherent roadmap for deploying sequential Gaussian Processes in real-world systems, bridging the gap between ML advances and practical application.

Daniel Waxman, Daniel Waxman, Fernando Llorente +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

University of Pisa & ISTI–CNR3w ago·also ISTI–CNR, University of Pisa

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Token-aware clustering and hierarchical indexing can slash retrieval latency by an order of magnitude without sacrificing accuracy, making multivector retrieval practical at scale.

Silvio Martinico, Silvio Martinico, Franco Maria Nardini +5

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

3w ago·also BU, Cornell, NTT Physics and Informatics Laboratories

Physical Foundation Models: Fixed hardware implementations of large-scale neural networks

Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.

Logan G. Wright, Tianyu Wang, Tatsuhiro Onodera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Xubin Luo +13w ago

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework

Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.

Xubin Luo, Yang Cheng

Distributed Systems & Hardware Inference & Quantization

Zhongguancun Academy3w ago·also USTC

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

LLMs can generate recommendations up to 3.1x faster by explicitly modeling token position within items and speculation depth during speculative decoding.

Jiaju Chen, Chongming Gao, Chenxiao Fan +4

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

Menglin Deng +193w ago·also Fudan, RUYi Dynamics Co

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM delivers production-grade VLM/LLM inference performance on edge devices, outperforming vendor-specific toolchains by up to 49% while remaining open-source and cross-platform.

Menglin Deng, Mengling Deng, Yuanpeng Chen +17

Computer Vision Inference & Quantization Multimodal Models

Nankai University3w ago·also Huawei

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Achieve up to 2.5X faster video object removal by focusing DiT computations only on the essential tokens dictated by the mask.

Chenyang Wu, Lina Lei, Fan Li +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Muhammad Ihsan Al Hafiz +33w ago

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.

Muhammad Ihsan Al Hafiz, Muhammad Ihsan Al Hafiz, Artur Podobas +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also Samsung Electronics

AME-PIM: Can Memory be Your Next Tensor Accelerator?

HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.

Emanuele Venieri, Simone Manoni, Alberto Florian +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jisheng Zhao +43w ago

CuLifter: Lifting GPU Binaries to Typed IR

Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.

Jisheng Zhao, Huanzhi Pu, Shinnung Jeong +2

Code Generation & Program Synthesis Distributed Systems & Hardware Inference & Quantization

Yan-Cheng Guo +23w ago

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zimiao Lin +23w ago

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.

Zimiao Lin, Zi-Wei Lin, Tian-Sheuan Chang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Xiumei Li +43w ago

TAFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement

Unlock bandwidth-adaptive point cloud transmission with TAFA-GSGC, a single-model codec that delivers up to 9 quality levels from a single bitstream.

Xiumei Li, Alexander Kopte, Alexander Kopte +2

Computer Vision Inference & Quantization

Yanting Wang +33w ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Red-teaming long-context LLMs just got a whole lot cheaper: FlashRT slashes the compute and memory costs of prompt injection attacks by up to 7x.

Yanting Wang, Chenlong Yin, Ying Chen +1

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Junqi Gao +93w ago

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Forget storing full task-specific models – Auto-FlexSwitch compresses the knowledge into tiny, dynamically assembled task vectors, slashing storage costs without sacrificing accuracy.

Junqi Gao, Junqi Gao, Dazhi Zhang +7

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago·also INRIA

Strait: Perceiving Priority and Interference in ML Inference Serving

Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.

Haidong Zhao, Nikolaos Georgantas, Nikolaos Georgantas

Distributed Systems & Hardware Inference & Quantization

Yanwu Gu +23w ago

Prediction-powered Inference by Mixture of Experts

Combining diverse AI prediction tools as a Mixture of Experts slashes variance in semi-supervised inference, outperforming standard Prediction-Powered Inference.

Yanwu Gu, Linglong Kong, Dong Xia

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago

Diffusion-OAMP for Joint Image Compression and Wireless Transmission

Ditch the training data: this method uses a pre-trained diffusion model to jointly compress and transmit images, outperforming classic techniques without any task-specific training.

Wentao Hou, W. Hou, Yiming Bai +4

Computer Vision Inference & Quantization

Vishnuprasadh Kumaravelu +33w ago·also IIT

Post-Optimization Adaptive Rank Allocation for LoRA

Get 4x-10x smaller LoRA models for free with a simple post-processing step that doesn't hurt performance.

Vishnuprasadh Kumaravelu, Sunil Gupta, P. K. Srijith +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Wei Li +63w ago·also Guangdong AIHISUN Technology Co.

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

You can now get real-time (825 FPS) crack detection on UAVs without sacrificing accuracy, thanks to a new attention-enhanced lightweight CNN.

Wei Li, Haisheng Li, Weijie Li +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Wenxiang Lin +53w ago·also HIT

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.

Wenxiang Lin, Xinglin Pan, Ruibo Fan +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Wei Cheng +63w ago

To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

LLMs can edit code 30% faster and cheaper without sacrificing accuracy, simply by learning to choose between generating full code and structure-aware diffs.

Wei Cheng, Yongchang Cao, Chen Shen +4

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

Apr 29, 2026

3w ago

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Barcelona Supercomputing Center3w ago·also UPC

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.

Mar Tejedor, Javier Conejero, Rosa M. Badia

Distributed Systems & Hardware Inference & Quantization

3w ago·also Open-EP Community

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Achieve faster VLM inference in bandwidth-constrained edge environments by adaptively compressing visual data, outperforming full-edge and full-cloud solutions without sacrificing semantic accuracy.

Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni

Computer Vision Inference & Quantization Multimodal Models

Hyunsung Yoon +33w ago

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.

Hyunsung Yoon, Sungju Ryu, Sungju Ryu +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also Southwestern University of Finance and Economics

CARD: Non-Uniform Quantization of Visual Semantic Unit for Generative Recommendation

Skewed item distributions in recommendation systems can be tamed with a learnable non-uniform quantization, leading to better codebook utilization and more accurate generative recommendations.

Yibiao Wei, Jie Zou, Pengfei Zhang +4

Inference & Quantization Multimodal Models Recommendation & Information Retrieval

3w ago

LLM-Guided Runtime Parameter Optimization for Energy-Efficient Model Inference

Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.

Katelyn Crumpacker, Dimitrios Nikolopoulos

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Verint Systems Inc3w ago

When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems

Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.

Emma Casey, David Roberts, David Sim +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

3w ago

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Shrinking diffusion LLMs by distilling across different architectures can yield surprisingly strong performance, even boosting code generation scores by 16 points on HumanEval.

Gongbo Zhang, Wen Wang, Ye Tian +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Apple ML3w ago·also CMU ML, UCSB

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

Forget coarse sequence-level hacks: LenVM lets you precisely dial in token generation length, boosting a 7B model's length accuracy from 30.9 to 64.8 and crushing closed-source rivals.

Zhen Zhang, Changyi Yang, Zijie Xia +13

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

3w ago

Co-Evolving Policy Distillation

By co-evolving experts through bidirectional policy distillation, CoPD achieves all-in-one integration of text, image, and video reasoning, outperforming domain-specific experts and suggesting a new training paradigm.

Naibin Gu, Chenxu Yang, Qingyi Si +7

Inference & Quantization Training Efficiency & Optimization

Jinbiao Wei +43w ago

Step-level Optimization for Efficient Computer-use Agents

Frontier models are wasted on routine GUI tasks: a step-level cascade that adaptively invokes stronger models only when lightweight monitors detect progress stalls or semantic drift slashes compute costs without sacrificing performance.

Jinbiao Wei, Kangqi Ni, Yilun Zhao +2

Inference & Quantization Tool Use & Agents Training Efficiency & Optimization

NVIDIA3w ago

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.

Hayate Iso, Tiyasa Mitra, Sudipta Mondal +22

Distributed Systems & Hardware Inference & Quantization RLHF & Preference Learning+1

Akshay Karjol +13w ago

Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation

Quantization crushes large object detection models for edge deployment, but knowledge distillation can resurrect them, even surpassing their original floating-point precision in a much smaller package.

Akshay Karjol, Darrin M. Hanna

Computer Vision Inference & Quantization Training Efficiency & Optimization

Yiqi Liu +43w ago

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.

Yiqi Liu, Noelle Crawford, Michael Wang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Bodon Jeong +83w ago

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.

Bodon Jeong, Bodon Jeong, Hongsu Byun +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Aditya Ukarande +73w ago

Efficient, VRAM-Constrained xLM Inference on Clients

Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.

Aditya Ukarande, Aditya Ukarande, Deep Shekhar +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

3w ago·also Vrije Universiteit Amsterdam

What Is the Cost of Energy Monitoring? An Empirical Study on the Overhead of RAPL-Based Tools

Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.

Jeremy Diamond, Vincenzo Stoico

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago·also Tencent AI

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.

Mingyu Zhao, Zijian Lin, Kun Wei +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing+1

3w ago

Efficient Listwise Reranking with Compressed Document Representations

Forget slow reranking: this new method compresses documents into embeddings, letting an 8B parameter model run up to 18x faster than smaller models with better accuracy.

Herv'e D'ejean, Hervé Déjean, St'ephane Clinchant +1

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

Wenxuan Ye +43w ago

Select to Think: Unlocking SLM Potential with Local Sufficiency

SLMs can match the reasoning performance of much larger models by simply re-ranking their own top-K token predictions, eliminating the need for expensive LLM calls at inference time.

Wenxuan Ye, Yangyang Zhang, Xueli An +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Reasoning & Chain-of-Thought

M. K. Khalidi Siam +73w ago

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.

M. K. Khalidi Siam, Md. Tausif-Ul-Islam, Md. Reshad Romim Khan +5

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

IHP -Leibniz-Institut für innovative3w ago·also BTU Cottbus-Senftenberg

Preventing Distinguishability between Multiplication and Squaring Operations

Your ECC implementation might be leaking secrets through power consumption differences between multiplication and squaring, regardless of your multiplication algorithm.

Alkistis Aikaterini Sigourou, Zoya Dyka, Peter Langendoerfer +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

3w ago·also DeepMind, AI Sequrity Company

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Dynamic quantization, a widely adopted optimization for efficient ML serving, can leak your data to adversaries sharing the same batch.

Hanna Foerster, Ilia Shumailov, Yiren Zhao +2

Inference & Quantization Training Efficiency & Optimization

3w ago·also SMU

An Empirical Study of Speculative Decoding on Software Engineering Tasks

Smaller models get a bigger speed boost from Speculative Decoding on software engineering tasks, challenging the assumption that larger models always benefit more from inference acceleration techniques.

Yijia Li, Junkai Chen, Xing Hu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Inference & Quantization

Shuzhao Xie +123w ago·also Shenzhen MSU-BIT University

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

Achieve 34x compression of 3D Gaussian Splatting models *without* sacrificing rendering quality, and sometimes even improving it.

Shuzhao Xie, Junchen Ge, Weixiang Zhang +10

Computer Vision Inference & Quantization

Apr 28, 2026

Tri-Nhan Vo +33w ago

Diverse Image Priors for Black-box Data-free Knowledge Distillation

Black-box knowledge distillation can be significantly improved by synthesizing diverse image priors and using contrastive learning to enhance the distinctions between synthetic samples.

Tri-Nhan Vo, Dang Nguyen, Trung Le +1

Computer Vision Data Curation & Synthetic Data Inference & Quantization

Ajmain Inqiad Alam +43w ago

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Slash your LLM's carbon footprint by up to 81% without sacrificing performance using a compression pipeline inspired by carbon taxation.

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy +2

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

Tri-Nhan Vo +23w ago

Improving Diversity in Black-box Few-shot Knowledge Distillation

Augmenting few-shot knowledge distillation with adaptively selected, teacher-confident GAN-generated images dramatically boosts student accuracy.

Tri-Nhan Vo, Dang Nguyen, Kien Do

Computer Vision Inference & Quantization Training Efficiency & Optimization

Chayanon Kitkana +13w ago

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Even when you think you're only teaching a model what *not* to do, sustained gradient alignment can lead to the unintended acquisition of undesirable traits.

Chayanon Kitkana, Shivam Arora

Inference & Quantization Training Efficiency & Optimization

3w ago

Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.

Haoran Chen, Wentao Wang

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Changyu Li +73w ago

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.

Changyu Li, Shuanghong Huang, Jiashen Liu +5

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Sehyeon Oh +23w ago

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

Integer-only attention is now a viable alternative to floating-point, delivering up to 8.69x speedups and 18.8% energy reduction on Vision Transformers.

Sehyeon Oh, Yongin Kwon, Jemin Lee

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Josué Obregon3w ago

RCProb: Probabilistic Rule Extraction for Efficient Simplification of Tree Ensembles

Rule extraction from tree ensembles just got 22x faster, without sacrificing accuracy or interpretability.

Josué Obregon

Inference & Quantization Interpretability & Mechanistic Interp

Ocean Monjur +23w ago

Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

Inference & Quantization Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Wenshuo Wang3w ago

Knowledge Distillation Must Account for What It Loses

Distilling large models into smaller ones can silently sacrifice crucial capabilities like safety and uncertainty awareness, even if headline metrics stay the same.

Wenshuo Wang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Inference & Quantization

3w ago

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.

Srikanta Prasad S, Utkarsh Arora

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also Aleph Alpha, Bosch Center for Artificial Intelligence

The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Forget fancy distillation losses: simple feature-based knowledge distillation, given enough compute, lets a ResNet-18 student nearly match a ResNet-101 teacher in semantic segmentation.

Muhammad Ali, Kevin Alexander Laube, Madan Ravi Ganesh +3

Computer Vision Inference & Quantization Training Efficiency & Optimization

Changyu Li +63w ago

PI-TTA: Physics-Informed Source-Free Test-Time Adaptation for Robust Human Activity Recognition on Mobile Devices

By injecting basic physics, this method achieves up to 9% accuracy gains in human activity recognition, proving that inductive biases still matter for real-world sensor data.

Changyu Li, Lu Wang, Ming Lei +4

Inference & Quantization Training Efficiency & Optimization

Evolutionairy AI3w ago

The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

LLMs from different vendors and sizes secretly speak the same statistical language, enabling a blazing-fast, model-agnostic output verification method.

Alex Bogdan, Adrian de Valois-Franklin

Eval Frameworks & Benchmarks Inference & Quantization Scaling Laws & Emergent Abilities

Dewei Bai +63w ago

QB-LIF: Learnable-Scale Quantized Burst Neurons for Efficient SNNs

SNNs can achieve higher accuracy and lower latency by learning the optimal spiking resolution for each layer, rather than relying on predefined burst structures.

Dewei Bai, Hongxiang Peng, Jiajun Mei +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Dewei Bai +53w ago·also UC Santa Cruz

Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

By intelligently pruning tokens based on spike timing and activation, Vision SmolMamba achieves state-of-the-art efficiency in spiking neural networks, outperforming even Spiking Mamba.

Dewei Bai, Hongxiang Peng, Yunyun Zeng +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Tensor AI Solutions GmbH3w ago·also DLR, Hensoldt Sensors GmbH, Ulm University

Quantum-Inspired Robust and Scalable SAR Object Classification

Tensor networks offer a surprisingly robust and efficient alternative to traditional neural networks for classifying noisy SAR imagery, even under data poisoning attacks.

Maximilian Scharf, Marco Trenti, Felix Bock +5

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

Oliver Bause +23w ago

Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy

Diagnose more, charge less: a new VCE pipeline slashes energy consumption by 40% by intelligently skipping bubble-filled frames without sacrificing diagnostic quality.

Oliver Bause, Jorg Gammerdinger, Julia Werner

Computer Vision Inference & Quantization

Yun Li +13w ago

Edge-Cloud Collaborative Reconstruction via Structure-Aware Latent Diffusion for Downstream Remote Sensing Perception

Overcome the bandwidth bottleneck in remote sensing with a collaborative edge-cloud approach that transmits structural priors, enabling high-fidelity super-resolution and boosting downstream perception tasks even under extreme compression.

Yun Li, Xianju Li

Computer Vision Inference & Quantization

Zhouzhi Xiong +43w ago

DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.

Zhouzhi Xiong, Zimou Zeng, Shu Xu +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

Ce Zheng +63w ago·also Pengcheng Laboratory

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.

Ce Zheng, Xinghan Wang, Jiahong Ning +4

Distributed Systems & Hardware Inference & Quantization

Zihao Xuan +63w ago

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.

Zihao Xuan, Jia Chen, Yewen Li +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Baijun Tan +13w ago

Lightweight Real-Time Rendering Parameter Optimization via XGBoost-Driven Lookup Tables

Achieve up to 70% faster rendering by distilling XGBoost models into lookup tables that adapt rendering parameters on a per-frame basis with sub-millisecond latency.

Baijun Tan, Francesco Moretti

Inference & Quantization Training Efficiency & Optimization

Zeyue Xue +113w ago

A Systematic Post-Train Framework for Video Generation

Unlock the full potential of your pretrained video diffusion models with a surprisingly simple four-stage post-training framework that drastically improves visual quality, temporal coherence, and instruction following.

Zeyue Xue, Siming Fu, Jie Huang +9

Computer Vision Inference & Quantization Training Efficiency & Optimization

Shouxu Lin +23w ago

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Forget prefetching: DAK unlocks up to 3x faster LLM inference by enabling direct GPU access to remote memory, achieving near-optimal system bandwidth utilization.

Shouxu Lin, Zhiyuan Guo, Jiaxin Lin

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

DAMO3w ago·also BAIR, Tsinghua AI, Intel Labs, Rice

Pythia: Toward Predictability-Driven Agent-Native LLM Serving

Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.

Xin Jin, Xuanzhe Liu

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Vyom Sharma +13w ago

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Stop leaving 10-70% of your MoE kernel throughput on the table: RaMP dynamically optimizes kernel configuration based on runtime expert routing, achieving up to 1.41x end-to-end speedup in vLLM serving.

Vyom Sharma, Debajyoti Datta

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ma Zirui +73w ago

AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

Mobile LLM inference just got a whole lot faster: AHASD achieves up to 4.2x throughput and 5.6x energy efficiency gains by intelligently decoupling and managing drafting and verification tasks on a PIM-NPU architecture.

Ma Zirui, Zhihua Fan, Wenxing Li +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Smart Sensors Group at Hamburg3w ago·also Hamburg University of Technology

At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

On-device cardiac monitoring is now feasible on ultra-low-power wearables, achieving 98% accuracy at just 8.55mW.

Kazi Mohammad Abidur Rahman, Davis Rakhshan, Philipp Lutke +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware+1

Robin Geens +33w ago·also ∗Equal contribution

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

LUT-based hardware architectures can achieve up to 2.2x area reduction for LLM inference by challenging conventional design assumptions and optimizing for activation data types.

Robin Geens, Joran Heldens, Joren Dumoulin +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mingbo Hao +73w ago·also SEU

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Forget GPUs – NVLLM's 3D NAND-centric design slashes LLM inference latency by up to 37.9x on edge devices, making on-device LLMs a real possibility.

Mingbo Hao, Changwei Yan, Haoyu Cui +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jangho Baik +43w ago

RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping

RecFlash slashes recommendation inference latency by up to 81% and energy consumption by nearly 92% through smart data remapping in NAND flash memory.

Jangho Baik, Sunghyun Kim, Gisan Ji +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Qazvin Islamic Azad University3w ago·also Islamic Azad University

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

WhisperPipe achieves 3-5x lower latency than existing streaming ASR solutions while consuming significantly less memory, making it a game-changer for real-time applications.

Erfan Ramezani, E. Ramezani, Mohammad Mahdi Giahi +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

3w ago·also NVIDIA, Columbia, Samsung Semiconductor, Yonsei

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.

Zhongkai Yu, Haotian Ye, Haotian Ye +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ke Dong +33w ago

TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing

TetrisG-SDK achieves up to 1.3x faster convolutional layer processing while slashing energy consumption by over 70% in some cases.

Ke Dong, Kejie Huang, Tao Luo +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sean Nian +43w ago

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

CacheFlow slashes LLM serving latency by up to 62% by rethinking KV cache restoration as a 3D-parallel scheduling problem, not just a recompute vs. I/O tradeoff.

Sean Nian, Jiahao Fang, Qilong Feng +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Apr 27, 2026

Yuanhao Zeng +63w ago·also Shang- haiTech University

Large Language Models Explore by Latent Distilling

Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.

Yuanhao Zeng, Ao Lu, Lu Li +4

Inference & Quantization Natural Language Processing

Christian Lysenstoen3w ago

Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces

Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.

Christian Lysenstoen

Inference & Quantization Training Efficiency & Optimization

Alex Bienstock +73w ago

Scalable Secure Biometric Authentication without Auxiliary Identifiers

Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.

Alex Bienstock, Daniel Escudero, Antigoni Polychroniadou +5

Distributed Systems & Hardware Inference & Quantization

Zeyu Bai3w ago

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Spark Policy Toolkit unlocks scalable policy learning in Spark by guaranteeing consistent results even with distributed execution, finally making it possible to apply complex policy learning techniques to large datasets.

Zeyu Bai

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Independent Researcher3w ago

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.

Ishan Patel, Ishan Patel, Ishan Joshi +1

Distributed Systems & Hardware Inference & Quantization

Minkyu Kim +73w ago

Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

The secret to effectively pruning LLMs might not be *how* you search for redundant layers, but *what* you're optimizing for.

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Miao Lin +43w ago

Laplace-Bridged Randomized Smoothing for Fast Certified Robustness

Edge devices can now achieve up to 494x faster certified robustness with Laplace-Bridged Smoothing, making formally verified AI deployments practical in resource-constrained settings.

Miao Lin, MD Saifur Rahman Mazumder, Fengyi Yu +2

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

3w ago

Compute Aligned Training: Optimizing for Test Time Inference

Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.

Adam Ousherovitch, Ambuj Tewari

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Sagnik Chatterjee +23w ago

Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.

Sagnik Chatterjee, Atharva Patil, S. Ramesh

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Ruhr University Bochum3w ago

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Not all layers are created equal: pruning the KV cache in a layer-dependent manner significantly boosts long-context LLM performance compared to uniform pruning strategies.

Zahra Dehghanighobadi, Asja Fischer

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

William Oliveira3w ago

Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

On-device SLMs in mobile apps demand a radical shift: the less the LLM does, the more reliable it becomes.

William Oliveira

Inference & Quantization Natural Language Processing Open-Source Models & Weights

Iizalaarab Elhaimeur +33w ago

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

3w ago

Network Impact of Post-Quantum Certificate Chain sizes on Time to First Byte in TLS Deployments

Quantum-safe certificates bloat TLS handshakes so much that they measurably degrade web performance, and current CDN optimizations aren't enough to fully compensate.

Matthew Chou, Phuong M Cao

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Verdict Security3w ago·also Ain Shams University

Machine-Checked Cardinality Bounds for Masked Barrett Reduction: A 1-Bit Side-Channel Leakage Barrier in Post-Quantum Cryptographic Hardware

Forget complex side-channel analysis: a single, machine-checked theorem proves that masked Barrett reduction leaks at most *one bit* of information per wire, offering a universal security guarantee for post-quantum crypto.

Ray Iskander, Khaled Kirah

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

School of Cyber Science and Technology3w ago

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.

Kaisheng Fan, Weizhe Zhang, Yishu Gao +2

Inference & Quantization Natural Language Processing Red-Teaming & Adversarial Robustness

Zihao Zheng +93w ago

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.

Zihao Zheng, Xingyu Zhou, Z. Mao +7

Inference & Quantization Multimodal Models Robotics & Embodied AI

Kaijun Zhou +53w ago

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.

Kaijun Zhou, Qiwei Chen, Dajiang Peng +3

Inference & Quantization Multimodal Models Robotics & Embodied AI