Laplacian DP and adaptive quantization can slash federated learning communication costs by over 50% without sacrificing accuracy or privacy, even with non-IID data.

Emre Ardiç, Emre Ardıç, Yakup Genç

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Apr 27, 2026

Yuanhao Zeng +6Apr 27, 2026·also Shang- haiTech University

Large Language Models Explore by Latent Distilling

Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.

Yuanhao Zeng, Ao Lu, Lufei Li +4

Inference & Quantization Natural Language Processing

Christian LysenstoenApr 27, 2026

Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces

Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.

Christian Lysenstoen

Inference & Quantization Training Efficiency & Optimization

All Papers (100)

Apr 27, 2026

Yuanhao Zeng +6Apr 27, 2026·also Shang- haiTech University

Large Language Models Explore by Latent Distilling

Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.

Yuanhao Zeng, Ao Lu, Lufei Li +4

Inference & Quantization Natural Language Processing

Christian LysenstoenApr 27, 2026

Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces

Standard black-box optimization falls apart when deploying ML models under tight constraints in crash-prone environments; TBA offers a robust, feasible-first alternative that actually works.

Christian Lysenstoen

Inference & Quantization Training Efficiency & Optimization

Alex Bienstock +7Apr 27, 2026

Scalable Secure Biometric Authentication without Auxiliary Identifiers

Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.

Alex Bienstock, Daniel Escudero, Antigoni Polychroniadou +5

Distributed Systems & Hardware Inference & Quantization

Zeyu BaiApr 27, 2026

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Spark Policy Toolkit unlocks scalable policy learning in Spark by guaranteeing consistent results even with distributed execution, finally making it possible to apply complex policy learning techniques to large datasets.

Zeyu Bai

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Independent ResearcherApr 27, 2026

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.

Ishan Patel, Ishan Patel, Ishan Joshi +1

Distributed Systems & Hardware Inference & Quantization

Minkyu Kim +7Apr 27, 2026·also SNU, University, USC

Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

The secret to effectively pruning LLMs might not be *how* you search for redundant layers, but *what* you're optimizing for.

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Miao Lin +4Apr 27, 2026

Laplace-Bridged Randomized Smoothing for Fast Certified Robustness

Edge devices can now achieve up to 494x faster certified robustness with Laplace-Bridged Smoothing, making formally verified AI deployments practical in resource-constrained settings.

Miao Lin, MD Saifur Rahman Mazumder, Fengyi Yu +2

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Apr 27, 2026

Compute Aligned Training: Optimizing for Test Time Inference

Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.

Adam Ousherovitch, Ambuj Tewari

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Sagnik Chatterjee +2Apr 27, 2026

Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.

Sagnik Chatterjee, Atharva Patil, S. Ramesh

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Ruhr University BochumApr 27, 2026

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Not all layers are created equal: pruning the KV cache in a layer-dependent manner significantly boosts long-context LLM performance compared to uniform pruning strategies.

Zahra Dehghanighobadi, Asja Fischer

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

William OliveiraApr 27, 2026

Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

On-device SLMs in mobile apps demand a radical shift: the less the LLM does, the more reliable it becomes.

William Oliveira

Inference & Quantization Natural Language Processing Open-Source Models & Weights

Iizalaarab Elhaimeur +3Apr 27, 2026

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Apr 27, 2026

Network Impact of Post-Quantum Certificate Chain sizes on Time to First Byte in TLS Deployments

Quantum-safe certificates bloat TLS handshakes so much that they measurably degrade web performance, and current CDN optimizations aren't enough to fully compensate.

Matthew Chou, Phuong M Cao

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Verdict SecurityApr 27, 2026·also Ain Shams University

Machine-Checked Cardinality Bounds for Masked Barrett Reduction: A 1-Bit Side-Channel Leakage Barrier in Post-Quantum Cryptographic Hardware

Forget complex side-channel analysis: a single, machine-checked theorem proves that masked Barrett reduction leaks at most *one bit* of information per wire, offering a universal security guarantee for post-quantum crypto.

Ray Iskander, Khaled Kirah

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

School of Cyber Science and TechnologyApr 27, 2026

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.

Kaisheng Fan, Weizhe Zhang, Yishu Gao +2

Inference & Quantization Natural Language Processing Red-Teaming & Adversarial Robustness

Zihao Zheng +9Apr 27, 2026

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.

Zihao Zheng, Xingyu Zhou, Z. Mao +7

Inference & Quantization Multimodal Models Robotics & Embodied AI

Kaijun Zhou +5Apr 27, 2026

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.

Kaijun Zhou, Qiwei Chen, Dajiang Peng +3

Inference & Quantization Multimodal Models Robotics & Embodied AI

Anthony Faure-Gignoux +3Apr 27, 2026

Compilation and Execution of an Embeddable YOLO-NAS on the VTA

Compiling and executing YOLO-NAS on an FPGA-based accelerator is now possible, opening doors for real-time object detection in safety-critical applications like aeronautics.

Anthony Faure-Gignoux, Kevin Delmas, Adrien Gauffriau +1

Computer Vision Distributed Systems & Hardware Inference & Quantization

Wang Fan +7Apr 27, 2026

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

Forget A100s for long-context LLMs – Salca achieves up to 74x better energy efficiency with a sparsity-aware hardware accelerator.

Wang Fan, Wei Cao, Xionghui Zha +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

DAMOApr 27, 2026

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Vanilla on-policy distillation falls apart in multi-turn settings due to compounding errors, but a simple curriculum on trajectory length fixes it, even letting students beat their teachers.

Jiaqi Wang, Wenhao Zhang, Weijie Shi +2

Inference & Quantization Tool Use & Agents Training Efficiency & Optimization

Institut Polytechnique de ParisApr 27, 2026

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Shrinking massive audio foundation models by up to 61x is now possible without significant performance loss, thanks to a novel self-supervised distillation approach that works directly on embeddings.

Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau +2

Inference & Quantization Speech & Audio Training Efficiency & Optimization

Apr 27, 2026·also ICT CAS, USTC

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Squeezing intermediate tensors with FP8 quantization and adaptive transforms can nearly double the throughput of tensor-parallel LLM training without sacrificing accuracy.

Man Liu, Xingjian Tian, Bing Lu +6

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Apr 25, 2026

Emre Ardiç +2Apr 25, 2026·also Gebze Technical University

Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning With Adaptive Quantization and Differential Privacy

Laplacian DP and adaptive quantization can slash federated learning communication costs by over 50% without sacrificing accuracy or privacy, even with non-IID data.

Emre Ardiç, Emre Ardıç, Yakup Genç

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Apr 23, 2026

Ashley Abraham +4Apr 23, 2026

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Scale up your nearest neighbor search without blowing your budget: this work shows how to use Dask to parallelize Product Quantization and Inverted Indexing, achieving accuracy comparable to single-machine methods on much larger datasets.

Ashley Abraham, Andrew Strelzoff, Haley R. Dozier +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Yixuan Zhu +7Apr 23, 2026

VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

VARestorer distills a text-to-image VAR model into a one-step super-resolution network, achieving state-of-the-art image quality with a 10x speedup.

Yixuan Zhu, Haolin Wang, Ao Li +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Wei JiangApr 23, 2026

Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

Forget compressing entire tokens – selectively routing *parts* of tokens based on query relevance unlocks better compression-quality tradeoffs in LoRA-adapted transformers.

Wei Jiang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Abbas Zeitoun +2Apr 23, 2026

Hyperloop Transformers

Halving the parameter count of LLMs without sacrificing performance is now possible with Hyperloop Transformers, thanks to looped layers and hyper-connected residual streams.

Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Boxun Xu +9Apr 23, 2026

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

Autoregressive video diffusion models can achieve faster decoding, lower memory footprint, and higher quality long-horizon generations by learning to attend to only the most salient spatiotemporal blocks.

Boxun Xu, Yuming Du, Zichang Liu +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Costin-Andrei Oncescu +5Apr 23, 2026

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Recurrent Transformers let you trade model depth for width, slashing KV cache memory footprint and inference latency without sacrificing performance.

Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Anuj Sadani +1Apr 23, 2026

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

LLM agents are wasting up to 60k tokens per turn on unnecessary tool schemas – Tool Attention slashes this "Tools Tax" by 95% and unlocks truly scalable agentic workflows.

Anuj Sadani, Deepak Kumar

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Tool Use & Agents

Dat To-Thanh +9Apr 23, 2026

Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

Achieve high-fidelity image enhancement on mobile devices even after quantization by training a model that anticipates and adapts to low-precision representations.

Dat To-Thanh, Dat To-Thanh, N. Nguyen-Trong +7

Computer Vision Inference & Quantization Training Efficiency & Optimization

Corresponding authorApr 23, 2026

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

Forget flat numerical compression – GS-Quant unlocks better knowledge graph completion by generating discrete codes that mirror the hierarchical nature of human reasoning.

Qizhuo Xie, Yunhui Liu, Yuecheng Xing +4

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Apr 23, 2026

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

LLMs can be both faster and smarter: pre-learned reasoning skills cut down token usage while boosting accuracy on coding and math problems.

Guangxiang Zhao, Qi Shi, Xusen Xiao +3

Inference & Quantization Reasoning & Chain-of-Thought Tool Use & Agents

K. FojcikApr 23, 2026

Efficient Logic Gate Networks for Video Copy Detection

Achieve competitive video copy detection accuracy with descriptors orders of magnitude smaller and inference speeds exceeding 11k samples per second by replacing floating-point operations with a learned Boolean circuit.

K. Fojcik

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

N. Severin +10Apr 23, 2026·also Sber AI Lab

Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

Get LLM-boosted recommendations without the LLM latency: this distillation method lets you bake rich user profiles into efficient sequential recommenders.

N. Severin, Danil Kartushov, V. Urzhumov +8

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

Apr 23, 2026

Multilinguality at the Edge: Developing Language Models for the Global South

Deploying language models in the Global South requires bridging the gap between multilingual NLP and edge computing, two fields that have largely evolved independently despite their shared goals.

Lester James Validad Miranda, Songbo Hu, Roi Reichart +1

Distributed Systems & Hardware Inference & Quantization Natural Language Processing

Apr 23, 2026·also Anhui Province Key Laboratory of Digital

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

LLM agent distillation leads to surprisingly high rates of behavioral mimicry, with some student models exhibiting tool-use habits *more* similar to their teachers than the teacher's own family members.

Chen Yang, Yuning Zhang, Zhoufutu Wen +4

Eval Frameworks & Benchmarks Inference & Quantization Tool Use & Agents

Jinrang Jia +2Apr 23, 2026·also Corresponding author

You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

Current 3D Gaussian Splatting methods are too unpredictable for real-world use, but YOGO makes them deterministic and production-ready.

Jinrang Jia, Zhenjia Li, Yifeng Shi

Computer Vision Inference & Quantization Training Efficiency & Optimization

Zhaohong Huang +4Apr 23, 2026

Prototype-Based Test-Time Adaptation of Vision-Language Models

Ditch the cache: Prototype-Based Test-Time Adaptation (PTA) boosts vision-language model accuracy by nearly 4% while *doubling* inference speed compared to existing cache-based methods.

Zhaohong Huang, Yuxin Zhang, Wenjing Liu +2

Computer Vision Inference & Quantization Multimodal Models

Jebacyril Arockiaraj +2Apr 23, 2026

ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

Edge devices can now learn continuously from visual data with 40x faster speed and 380x better energy efficiency, thanks to a novel FPGA accelerator design.

Jebacyril Arockiaraj, Dhruv Parikh, Viktor K. Prasanna

Computer Vision Inference & Quantization Training Efficiency & Optimization

IITApr 23, 2026·also Edinburgh

Leveraging SIMD for Accelerating Large-number Arithmetic

SIMD parallelism can finally unlock substantial speedups in large-number arithmetic by rethinking algorithms around data-parallel operations, yielding up to 19.3% throughput gains in scientific computing.

Subhrajit Das, Abhishek Bichhawat, Yuvraj Patel

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Mohan Liyanage +3Apr 23, 2026

Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

Reduce deadline misses and server switching by explicitly accounting for tail risk and stability in edge server selection.

Mohan Liyanage, Arnova Abdullah, E. Zhantileuov +1

Distributed Systems & Hardware Inference & Quantization

Hongyao Liu +2Apr 23, 2026

An Efficient Wireless iBCI Headstage with Adaptive ADC Sample Rate

A server-driven adaptive sampling approach slashes power consumption in wireless iBCIs by 40mW while *improving* decoding accuracy.

Hongyao Liu, Junyi Wang, L. Zhai

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Hongyao Liu +3Apr 23, 2026

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

On-device LLM inference gets a massive speed and energy boost by adaptively streaming only the most expensive parts of the KV cache from the cloud.

Hongyao Liu, L. Zhai, Junyi Wang +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mingqi Han +1Apr 23, 2026

A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

Forget simple offloading – this framework intelligently decomposes LLM tasks across devices and edge servers, slashing latency and boosting rewards in congested WiFi networks.

Mingqi Han, Xing Sun

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Yilong Chen +12Apr 23, 2026·also CAS

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

By dynamically injecting frequency-aware n-gram features, X-GRAM achieves state-of-the-art accuracy with smaller embedding tables, offering a practical path to scaling memory-augmented architectures.

Yilong Chen, Yan Xie, Zitian Gao +10

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Ankur Sharma +3Apr 23, 2026

Time, Causality, and Observability Failures in Distributed AI Inference Systems

Clock skew as small as 5ms can break causality in observability data from distributed AI inference systems, even when the system is working perfectly.

Ankur Sharma, Deep Shah, David Lariviere +1

Distributed Systems & Hardware Inference & Quantization

Apr 22, 2026

Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

SpanDec achieves state-of-the-art NER accuracy with significantly improved throughput, proving that you don't need to exhaustively process every possible span to achieve top performance.

Andrea Maracani, Savas Ozkan, Junyi Zhu +2

Inference & Quantization Natural Language Processing

Apr 22, 2026

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Exact attention over billion-token sequences is now possible on a single GPU, thanks to a novel streaming approach that avoids out-of-memory errors without approximation.

Yiming Bian, Joshua M. Akey

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

An T. Le +1Apr 22, 2026

AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

Differentiable landmark selection for shortest-path heuristics can provably preserve admissibility, achieving near-optimal coverage and faster query times compared to traditional methods.

An T. Le, Vien Ngo

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Samuel SalfatiApr 22, 2026

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Forget pruning by variance: high-variance activations in transformers are surprisingly uncorrelated with predictive power.

Samuel Salfati

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scaling Laws & Emergent Abilities

H. Pham +1Apr 22, 2026

Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Optimizing AI inference can boost throughput and reduce latency, revealing strategies that enhance performance under real-world traffic conditions.

H. Pham, Fatih Gedikli

Distributed Systems & Hardware Inference & Quantization

Deevashwer Rathee +4Apr 22, 2026

Onyx: Cost-Efficient Disk-Oblivious ANN Search

Leaking user queries through disk access patterns in sensitive ANN search? Onyx flips the script on prior work to achieve up to 9.9x cost reduction and 12.3x latency improvement.

Deevashwer Rathee, Jean Watson, Zirui Neil Zhao +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Apr 22, 2026

FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory

Forgetting isn't a bug, it's a feature: selectively pruning LLM agent memories boosts efficiency by 8%, sharpens content quality by 29%, and eliminates security risks entirely.

Yingjie Gu, Bo Xiong, Yijuan Guo +6

Inference & Quantization Tool Use & Agents Training Efficiency & Optimization

N. M. Gil +5Apr 22, 2026

Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring

Get calibrated anomaly detection from time series foundation models without any fine-tuning, even when the data distribution shifts.

N. M. Gil, F. O'Donncha, Wesley M. Gifford +3

Inference & Quantization Natural Language Processing

A. Gupta +2Apr 22, 2026

On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

Diffusion language models withstand aggressive quantization better than autoregressive models, suggesting a path to efficient deployment.

A. Gupta, Gururaj Deshpande, Chandreyi Chakraborty

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Inference & Quantization

Chenyuan Zhang +7Apr 22, 2026·also Tsinghua AI, HIT, SJTU

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

Reasoning across languages doesn't have to break the bank: a new framework slashes token costs by over 50% while maintaining accuracy, especially boosting performance in low-resource languages.

Chenyuan Zhang, Qiguang Chen, Xie Chen +5

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

Verdict SecurityApr 22, 2026·also Ain Shams University

Fresh Masking Makes NTT Pipelines Composable: Machine-Checked Proofs for Arithmetic Masking in PQC Hardware

Machine-checked proofs now guarantee the security of arithmetic masking in NTT pipelines, but watch out: even a single lapse in "fresh masking" can expose vulnerabilities, as seen in the Adams Bridge accelerator.

Ray Iskander, Khaled Kirah

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Apr 22, 2026

DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks

LLMs can bootstrap accurate and efficient log parsing by synthesizing regex masks, enabling a hybrid approach that outperforms both heuristic and LLM-only methods.

Amir Shetaia, Sean Kauffman

Distributed Systems & Hardware Inference & Quantization Natural Language Processing

Yixiao Zeng +12Apr 22, 2026·also NTU

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

Achieve 2.6x faster autoregressive world model inference without retraining by caching and selectively reusing block-level residuals across generation chunks.

Yixiao Zeng, Jianlei Zheng, Chaoda Zheng +10

Computer Vision Inference & Quantization World Models & Planning

Pham Phuong Nam Nguyen +3Apr 22, 2026

Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training

Distilling knowledge from a Mamba-based teacher network significantly boosts the performance of quantized INT8 super-resolution models, enabling high-quality image enhancement on resource-constrained mobile devices.

Pham Phuong Nam Nguyen, Nam Le, Thi Kim Anh Vo +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Apr 22, 2026·also KITECH School, Manufacturing AI Research Center

Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images

Achieve near-perfect (96.35% Dice) maxillary sinus segmentation from X-rays with limited labeled data by distilling knowledge from GAN-refined pseudo-labels.

Juha Park, Jiho Choi, J. Yun +4

Computer Vision Data Curation & Synthetic Data Inference & Quantization

Donghua UniversityApr 22, 2026·also Malanshan Audio & Video Laboratory, School of Electronic Information

Feedback-Driven Rate Control for Learned Video Compression

Achieving stable bitrate tracking in learned video compression can reduce average bitrate errors to as low as 2.13%, transforming how we manage video quality under constraints.

Zhiheng Xu, Xuerui Ma, Chunhua Peng

Computer Vision Inference & Quantization

Apr 22, 2026·also CAS

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

Fine-grained management of speculative decoding phases can boost LLM serving throughput by over 50% and cut latency nearly in half.

Wenyan Chen, Chengzhi Lu, Yanying Lin +1

Distributed Systems & Hardware Inference & Quantization

Kyungmi Lee +5Apr 22, 2026·also Microsoft Research

EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

Forget hours-long simulations: EnergAIzer slashes GPU power estimation time to seconds while maintaining accuracy, by exploiting structured patterns in AI kernel optimizations.

Kyungmi Lee, Zhiye Song, Eun Kyung Lee +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Naser Khatti Dizabadi +1Apr 22, 2026

A Novel Low-Power Cache Architecture Based on 6-Transistor SRAM Cells

Stacking SRAM cells slashes leakage power without adding transistors.

Naser Khatti Dizabadi, Ceyda Elcin Kaya

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Apr 22, 2026·also ETH, AI Center Tübingen, ELLIS, Tübingen +1

Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.

Johannes Zenn, Guinan Su, Mrinmaya Sachan +1

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

Peng Peng +4Apr 22, 2026

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Speed up your RAG pipelines by up to 37% without sacrificing accuracy by speculatively retrieving documents based on query homology.

Peng Peng, Weiwei Lin, Wentai Wu +2

Inference & Quantization Recommendation & Information Retrieval

Apr 22, 2026·also NAVER Labs, Samsung Electronics, UIUC

PVAC: A RowHammer Mitigation Architecture Exploiting Per-victim-row Counting

Flipping the script on RowHammer defense, PVAC counts activations on victim rows instead of aggressors, slashing false positives and boosting performance.

Jumin Kim, Seungmin Baek, Hwayong Nam +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Wenhong Zhu +3Apr 22, 2026

Hybrid Policy Distillation for LLMs

Hybrid Policy Distillation achieves superior performance by harmonizing the strengths of forward and reverse KL divergence, transforming the landscape of knowledge distillation for LLMs.

Wenhong Zhu, Ruobing Xie, Rui Wang +1

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Apr 21, 2026

Akash Yadav +2Apr 21, 2026

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Get calibrated uncertainty estimates from your scientific foundation models in minutes, not days, with this simple attention randomization trick.

Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scientific Discovery & Drug Design

Apr 21, 2026

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

Naive attention-based filtering for edge-cloud inference is suboptimal under tight bandwidth constraints; prioritizing semantic diversity in transmitted embeddings yields surprisingly large accuracy gains.

Inhyeok Choi, Hyuncheol Park

Distributed Systems & Hardware Inference & Quantization

ETHApr 21, 2026·also Tsinghua AI, College of Computing and Data Science, NTU, UMich

Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments

TurboQuant's claimed advantages over RaBitQ in quantization don't hold up under rigorous, reproducible comparison, raising questions about its practical utility.

Jianyang Gao, Yutong Gou, Yuexuan Xu +5

Inference & Quantization Open-Source Models & Weights Training Efficiency & Optimization

University of LübeckApr 21, 2026·also University of Pisa

Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

Compact, gradient-free MARS models can now outperform state-of-the-art gradient-based sequence models like Mamba, while slashing training times from hours to milliseconds.

Coşku Can Horuz, Andrea Ceni, Claudio Gallicchio

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Siqing Song +2Apr 21, 2026

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

LLMs can be aggressively quantized to W(1+1)A4 without significant performance degradation using a surprisingly simple three-stage distillation approach.

Siqing Song, Yong Lang, Xu-Yao Zhang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Jinda Jia +9Apr 21, 2026

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Forget fancy quantization schemes – a simple token-wise INT4 quantization with Hadamard rotation is all you need to nearly match FP16 accuracy in LLM serving, without sacrificing throughput.

Jinda Jia, Jisen Li, Zhongzhu Zhou +7

Distributed Systems & Hardware Inference & Quantization

Zhenbang Du +8Apr 21, 2026

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Ditch the slow lane: $R^2$-dLLM turbocharges diffusion language models by slashing decoding steps by up to 75% without sacrificing quality.

Zhenbang Du, Kejing Xia, Xinrui Zhong +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Linwei Dong +5Apr 21, 2026·also ZJU

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

Forget noisy samples, RL can now directly optimize the *gradients* of diffusion distillation, leading to SOTA few-step image generation.

Linwei Dong, Ruoyu Guo, Ge Bai +3

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Weixiao Zhan +3Apr 21, 2026

Distillation Traps and Guards: A Calibration Knob for LLM Distillability

You can now dial a knob to make your LLM either super-distillable or completely un-distillable, opening up new possibilities for both efficient knowledge transfer and robust model protection.

Weixiao Zhan, Yongcheng Jing, Leszek Rutkowski +1

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Apr 21, 2026·also Osaka, Shenzhen University

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Achieve 50% parameter reduction in LLaMA-2-7B with minimal performance loss and no fine-tuning, thanks to a new global gating-based structured pruning method.

Ziyang Wang, Jiangfeng Xiao, Chuan Xiao +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Yuli Chen +6Apr 21, 2026·also SUSTech

SimDiff: Depth Pruning via Similarity and Difference

Similarity alone is a poor guide for LLM depth pruning: jointly considering representational similarity *and* transformation difference unlocks significantly better compression.

Yuli Chen, Shuhao Zhang, Fanshen Meng +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Proximus Luxembourg S.AApr 21, 2026·also Luxembourg

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Forget scaling laws: strategically equipping small language models with tools delivers a better performance/cost tradeoff than simply scaling up or deploying multi-agent systems.

Xinlin Wang, Mats Brorsson

Inference & Quantization Scaling Laws & Emergent Abilities Tool Use & Agents

Apr 21, 2026

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Achieve near-lossless performance in autonomous driving VLMs with 90% token reduction – without any training.

Linjie Sha, Haiyun Guo, Jinqiao Wang +1

Computer Vision Inference & Quantization Multimodal Models

Apr 21, 2026

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

Multi-modal models can now better handle distribution shifts thanks to a new method that explicitly models how different categories are distributed, even when the modalities are asymmetrical.

Jinglin Xu, Chuxiong Sun, Xiao Xu +2

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Apr 21, 2026·also BUET, Kyung Hee University, PolyU

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

Attention's quadratic complexity is no longer a bottleneck: DASH-KV achieves linear O(N) inference without sacrificing accuracy by reformulating attention as an approximate nearest-neighbor search.

Yutong Li, Jiehui Xie, Md. Tamim Iqbal +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Stanford HAIApr 21, 2026·also Macquarie

Are Large Language Models Economically Viable for Industry Deployment?

Forget chasing the biggest LLM – this benchmark reveals that smaller models (<2B params) can deliver 3x better energy efficiency and faster ROI in real-world industry deployments.

Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora +4

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

School of InformaticsApr 21, 2026·also Noah's Ark Lab

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

Ditch the slow "think-first-then-translate" paradigm: ReflectMT internalizes reflection, delivering faster and better machine translation in a single pass.

Kunquan Li, Yingxue Zhang, Fandong Meng

Inference & Quantization Natural Language Processing Reasoning & Chain-of-Thought

School of TechnologyApr 21, 2026·also Van Lang University

CHRONOS: A Hardware-Assisted Phase-Decoupled Framework for Secure Federated Learning in IoT

Federated learning can be sped up by 74% without sacrificing security, thanks to a novel hardware-assisted approach that cleverly decouples cryptographic setup from the active training phase.

Hung Dang, Hung Dang

Distributed Systems & Hardware Inference & Quantization

College of Semiconductor ResearchApr 21, 2026·also Department of Electrical Engineering, National Tsing Hua University

Silicon Aware Neural Networks

Neural networks made of logic gates can now be directly compiled to silicon, achieving impressive MNIST classification speeds with low power consumption.

Sebastian Fieldhouse, Kea-Tiong Tang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Francesco Moretti +3Apr 21, 2026

Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

Achieve state-of-the-art small object detection in high-resolution imagery while slashing inference time by 20-25% using adaptive slicing.

Francesco Moretti, Yi Jin, Guiqin Mario +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Apr 21, 2026·also Ajou University, Loyola University Chicago, UMN, USTC

Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

The LPCVC 2025 winning solutions showcase surprisingly effective strategies for balancing accuracy and efficiency in edge-based computer vision, pushing the boundaries of what's possible on resource-constrained devices.

Zihao Ye, Yung Hsiang Lu, Xiao Hu +13

Computer Vision Eval Frameworks & Benchmarks Inference & Quantization

Joongho Jo +3Apr 21, 2026

AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

Achieve over an order of magnitude speedup in 3D Gaussian Splatting by adaptively scaling Gaussians based on their color contribution, without sacrificing visual fidelity.

Joongho Jo, Hyerin Lim, Hanjun Choi +1

Computer Vision Inference & Quantization Training Efficiency & Optimization

Embedded Systems LabApr 21, 2026

Energy Efficient LSTM Accelerators for Embedded FPGAs Through Parameterised Architecture Design

Achieve LSTM acceleration on embedded FPGAs with 11.89 GOP/s/W energy efficiency by tuning architectural parameters.

Chao Qian, Tianheng Ling, Gregor Schiele7

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Wen Cheng +8Apr 21, 2026

Micro Language Models Enable Instant Responses

Instant AI assistants are now feasible on smartwatches: 8M-parameter models can kickstart responses locally, hiding cloud latency with surprisingly high quality.

Wen Cheng, Wen-Huang Cheng, Tuochao Chen +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Qingyang Zhang +9Apr 21, 2026

TEMPO: Scaling Test-time Training for Large Reasoning Models

Test-time training can finally scale for large reasoning models: TEMPO unlocks sustained performance gains by interleaving policy refinement with periodic critic recalibration, boosting accuracy by over 18% on challenging benchmarks.

Qingyang Zhang, Xinke Kong, Haitao Wu +7

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Apr 21, 2026

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

LLMs break in two fundamentally different ways when pushed to extreme quantization: either through gradual information loss or sudden functional breakdown of key components.

Chenxi Zhou, Pengfei Cao, Jianguo Li +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Apr 21, 2026

Secure Storage and Privacy-Preserving Scanpath Comparison via Garbled Circuits in Eye Tracking

Unlock privacy-preserving eye-tracking analysis with garbled circuits, enabling secure scanpath comparison without revealing sensitive gaze data.

Suleyman Ozdel, Amr Nader, Amr A. Nader +2

Distributed Systems & Hardware Inference & Quantization

Yangming Zhang +4Apr 21, 2026

Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training

Training 3D Gaussian Splatting models on edge devices is now practical: this method slashes peak memory consumption by 80% without sacrificing visual quality.

Yangming Zhang, Jian Xu, Kunxiong Zhu +2

Computer Vision Inference & Quantization Training Efficiency & Optimization

A. Zamani +1Apr 21, 2026

Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach

Lightweight UAV detectors get a surprisingly large boost in accuracy and robustness from a carefully tuned Mosaic and HSV augmentation pipeline, outperforming more complex methods.

A. Zamani, Zeinab Abedini

Computer Vision Data Curation & Synthetic Data Inference & Quantization

Apr 21, 2026

Online CS-based SAR Edge-Mapping

Ditch bulky SAR image reconstruction: this online edge-mapping technique slashes memory and compute costs for UAV-based target recognition.

Conor Flynn, R.D. Ivanov, Birsen Yazici

Computer Vision Inference & Quantization