March 4 – March 11, 2026

Distributed Systems & Hardware - Weekly Roundup

100 papers published across 6 labs.

26% acceleration

Selected Labs publishing this week

Tsinghua AI4 CMU ML2 Microsoft Research2 MIT CSAIL1 Mila1

Top Papers

Mar 11, 2026

3w ago·also Huawei, SUSTech, UAlberta

Edge-Assisted Multi-Robot Visual-Inertial SLAM With Efficient Communication

Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.

Xin Liu, Shuhuan Wen, Jing Zhao +249

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Shuai Dong +53w ago

In-Memory ADC-Based Nonlinear Activation Quantization for Efficient In-Memory Computing

By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.

Shuai Dong, Junyi Yang, Biyan Zhou +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yukiko Uchino +23w ago

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.

Yukiko Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

M. Rehman +23w ago

Incremental Federated Learning for Intrusion Detection in IoT Networks under Evolving Threat Landscape

Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.

M. Rehman, Hayretdin Bahs, Rajesh Kalakoti

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

All Papers (100)

Mar 11, 2026

Shuai Dong +53w ago

In-Memory ADC-Based Nonlinear Activation Quantization for Efficient In-Memory Computing

By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.

Shuai Dong, Junyi Yang, Biyan Zhou +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yukiko Uchino +23w ago

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.

Yukiko Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

M. Rehman +23w ago

Incremental Federated Learning for Intrusion Detection in IoT Networks under Evolving Threat Landscape

M. Rehman, Hayretdin Bahs, Rajesh Kalakoti

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Do-Yeon Kim +33w ago

Efficiency vs Demand in AI Electricity: Implications for Post-AGI Scaling

AI electricity demand won't necessarily explode as AI scales – whether it does or doesn't hinges on sustained efficiency improvements outpacing income-driven demand.

Do-Yeon Kim, Jiseok Ahn, H. Mcjeon +1

Distributed Systems & Hardware Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Xian Qin +23w ago

Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning

Forget ZKPs: this federated learning scheme uses "self-destructing" backdoors to verify aggregation integrity, achieving 1000x speedups over traditional crypto.

Xian Qin, Xue Yang, Xiaohu Tang

Distributed Systems & Hardware Training Efficiency & Optimization

Saarland University3w ago·also DFKI

Type-safe Monitoring of Parameterized Streams

Guarantee runtime safety in complex cyber-physical systems with unbounded data domains using a refinement type system for parameterized streams, even though it's generally undecidable.

Jan Baumeister, Bernd Finkbeiner, Florian Kohn

Distributed Systems & Hardware Robotics & Embodied AI

3w ago·also York

Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Training embodied intelligence models just got 40x faster thanks to a thousand-GPU cloud platform and a suite of optimizations spanning data pipelines, model architecture, and infrastructure.

Haoran Sun, Hedan Yang, Jing Long +19

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

S. Seelam +123w ago

Reference Architecture of a Quantum-Centric Supercomputer

Quantum-Centric Supercomputers promise to break down the barriers between quantum and classical computing, enabling seamless hybrid algorithms and accelerating discovery across applications.

S. Seelam, Jerry M Chow, Antonio C'orcoles +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

IMDEA Software Institute3w ago·also Keysight Technologies, Telefonica Research

Aceso: Carbon-Aware and Cost-Effective Microservice Placement for Small and Medium-sized Enterprises

SMEs can slash carbon emissions by 37% and costs by 3.6% simply by using Aceso's carbon-aware microservice placement, even with regionally limited infrastructure.

Georgia Christofidi, Francisco Álvarez-Terribas, Ioannis Roumpos +3

Distributed Systems & Hardware Training Efficiency & Optimization

Yonas Atinafu +23w ago

Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI

Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.

Yonas Atinafu, Henry Lin, Robin Cohen

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Niusha Khosravi +23w ago

Distributed Kalman--Consensus Filtering with Adaptive Uncertainty Weighting for Multi-Object Tracking in Mobile Robot Networks

By adaptively weighting neighbor information based on uncertainty, distributed multi-object tracking can achieve significantly better performance in mobile robot networks with heterogeneous localization quality.

Niusha Khosravi, R. Ventura, M. Basiri

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

M. Anwar +53w ago

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

Multi-robot systems can slash battery consumption by 15% and boost GPU utilization by 50% for large DNN inference by using a hybrid offline-online reinforcement learning strategy to dynamically schedule and distribute DNN module execution.

M. Anwar, Anuradha Ravi, Indrajeet Ghosh +3

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Kadir-Kaan Özer +23w ago

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

IMDEA Software Institute3w ago

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

Secure multi-tenant LLM serving without sacrificing performance is now possible: CacheSolidarity selectively isolates prefixes, boosting cache reuse by up to 70% and cutting inference latency by 30% compared to blunt-force defenses.

Panagiotis Georgios Pennas, Konstantinos Papaioannou, Marco Guarnieri +1

Distributed Systems & Hardware Inference & Quantization

David G'omez-Cambronero +23w ago

Layered Performance Analysis of TLS 1.3 Handshakes: Classical, Hybrid, and Pure Post-Quantum Key Exchange

Quantifying the overhead of post-quantum cryptography reveals exactly where the performance bottlenecks lie in real-world TLS 1.3 transactions.

David G'omez-Cambronero, D. Munteanu, Ana Isabel Gonz'alez-Tablas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

Report for NSF Workshop on Algorithm-Hardware Co-design for Medical Applications

Algorithm-hardware co-design could revolutionize medical technology, but realizing its potential requires a fundamental shift in how these systems are conceived, designed, validated, and translated into practice.

Peipei Zhou, Zhen Dong, Insup Lee +8

Distributed Systems & Hardware Scientific Discovery & Drug Design

Yilin Zou +33w ago

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Trajectory optimization just got a whole lot faster and more energy-efficient: a GPU-native solver achieves 4x speedup and halves energy consumption compared to optimized CPU baselines.

Yilin Zou, Zhong Zhang, Maxime Robic +1

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

Software Competence Center Hagenberg3w ago

A PUF-Based Approach for Copy Protection of Intellectual Property in Neural Network Models

Stop neural network model theft: bind your models to specific hardware using PUFs, rendering them useless on clones.

Daniel Dorfmeister, Martin Schwandtner, Hannes Sochor

Distributed Systems & Hardware Inference & Quantization

3w ago·also Huawei, SUSTech, UAlberta

Edge-Assisted Multi-Robot Visual-Inertial SLAM With Efficient Communication

Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.

Xin Liu, Shuhuan Wen, Jing Zhao +249

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

G. Reali +13w ago

Topological Analysis for Identifying Anomalies in Serverless Platforms

Uncovers hidden architectural inefficiencies in serverless platforms by modeling function interactions as topological flows and identifying persistent "harmonic modes" that resist local fixes.

G. Reali, M. Femminella

Distributed Systems & Hardware

Qiyue Chen +53w ago

An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS

A pipelined FPGA architecture slashes the power consumption of JPEG XS's Intra Pattern Copy displacement vector search, enabling practical hardware deployment for low-latency image compression.

Qiyue Chen, Yao Li, Jie Tao +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yujie Liao +43w ago

OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

A fully open-source speech understanding model, OSUM-Pangu, proves that competitive performance is achievable on non-CUDA hardware, challenging the dominance of GPU-centric ecosystems.

Yujie Liao, Xuelong Geng, Hongfei Xue +2

Distributed Systems & Hardware Open-Source Models & Weights Speech & Audio

Yangyang Wang +33w ago

CD-Raft: Reducing the Latency of Distributed Consensus in Cross-Domain Sites

CD-Raft slashes distributed consensus latency by nearly 50% in cross-domain settings, offering a significant speedup for data-intensive AI workloads.

Yangyang Wang, Ziqian Cheng, Yucong Dong +1

Distributed Systems & Hardware Training Efficiency & Optimization

S.-L. Ng +23w ago

A systematic review of secure coded caching

Secure coded caching, crucial for modern content delivery, often treats security as an afterthought, resulting in fragmented solutions that this review seeks to unify and improve.

S.-L. Ng, M. Paterson, E. Quaglia

Distributed Systems & Hardware Recommendation & Information Retrieval

3w ago

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

AgentServe achieves up to 2.8x improvement in time-to-first-token and 2.7x in tokens-per-output-token for agentic workloads on a single GPU by strategically isolating prefills and decodes.

Yuning Zhang, Yan Yan, Nan Yang +1

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Mar 10, 2026

MIT CSAIL3w ago·also Mazu Networks#TAB#

Role Classification of Hosts within Enterprise Networks Based on Connection Patterns

Uncover hidden network structure and simplify management by automatically classifying hosts into meaningful roles based on their connection patterns.

Godfrey Tan, Massimiliano Poletto, John Guttag +1

Distributed Systems & Hardware

Yinpeng Wu +53w ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

On-device LLM inference can be sped up by an order of magnitude with a flexible TrustZone-based system that selectively protects memory and the NPU.

Yinpeng Wu, Yitong Chen, Lixiang Wang +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sunjung Lee +103w ago

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

On-device LLM inference with PIM is now more practical: PIM-SHERPA resolves memory inconsistencies, slashing memory capacity needs by ~50% without sacrificing performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

José Luis Conradi Hoffmann +13w ago

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

Ditch the latency tax of traditional scheduling: this new approach delivers data "just-in-time" for safety-critical systems, boosting performance without sacrificing reliability.

José Luis Conradi Hoffmann, Antônio Augusto Fröhlich

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Robotics & Embodied AI

Xiamen University3w ago·also Tsinghua AI, Chongqing, Openharmony Community

Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices

By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.

Xufeng Yang, Tingting Tan, Jingxin Hu +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

São Paulo State University (UNESP)3w ago·also CERN, Fermilab, T2_BR_SPRACE

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

A virtualized XRootD frontend can sustain over 50 Gb/s throughput in real-world large-scale WAN transfers, challenging assumptions about virtualization overhead in high-performance data systems.

J M da Silva, M A Costa, R L Iope

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

J. Tu +63w ago

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

FP64 tensor cores, previously untapped for large-scale scientific computing, now unlock 2x speedups and 83% energy savings in finite element simulations on NVIDIA's latest GPUs.

J. Tu, I. Karlin, J. Camier +4

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Seydina Ousmane Diallo +23w ago

Enabling Multi-Client Authorization in Dynamic SSE

Achieve fine-grained access control in searchable encryption without re-encryption or excessive interaction, enabling practical multi-client deployments in dynamic clouds.

Seydina Ousmane Diallo, Maryline Laurent, Nesrine Kaaniche

Distributed Systems & Hardware Recommendation & Information Retrieval

3w ago

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.

Zhiye Tang, Qiudan Zhang, Junhui Hou +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

3w ago

Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning

Distributing SciML models with hardware and physics awareness slashes latency and energy consumption by over 8x and 33x, respectively, while paradoxically *improving* reconstruction fidelity.

Yuchen Yuan, Junhuan Yang, Hao Wan +4

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Zheng Fang +63w ago

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

By incorporating language guidance into federated learning, SurgFed tackles the long-standing problem of tissue and task heterogeneity in surgical video understanding, leading to improved segmentation and depth estimation across diverse surgical settings.

Zheng Fang, Ziwei Niu, Ziyue Wang +4

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

3w ago

MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics

Forget waiting hours: this MORL framework achieves 270x speedups on robotics tasks thanks to GPU-native parallelization.

Neil C. Janwani, Ellen R. Novoseller, V. Lawhern +1

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

3w ago

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

Nezha shatters I/O bottlenecks in distributed key-value stores by decoupling key-value persistence within Raft, yielding up to 4.6x throughput gains.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

University of St Andrews3w ago·also Durham University

Multi-DNN Inference of Sparse Models on Edge SoCs

By recombining subgraphs from sparse models without retraining, "model stitching" creates a diverse set of model variants that significantly improves the efficiency of multi-DNN inference on edge SoCs.

Jiawei Luo, Simon Dobson

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Shumin Yao +53w ago·also (Corresponding author: Rui Meng and Xiaodong

Unlocking High-Fidelity Analog Joint Source-Channel Coding on Standard Digital Transceivers

Finally, analog joint source-channel coding can be deployed on standard digital transceivers, unlocking the potential of semantic communication on existing infrastructure.

Shumin Yao, Yaping Sun, Nan Ma +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Lionel Yelibi3w ago

a-TMFG: Scalable Triangulated Maximally Filtered Graphs via Approximate Nearest Neighbors

TMFGs can now scale to millions of data points thanks to a-TMFG, which approximates the correlation matrix on-the-fly using kNN graphs and clever memory management.

Lionel Yelibi

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Elian Alfonso Lopez Preciado3w ago

Dynamic Precision Math Engine for Linear Algebra and Trigonometry Acceleration on Xtensa LX6 Microcontrollers

Get up to 24x faster sine/cosine calculations on ESP32 microcontrollers by dynamically switching between fixed-point and floating-point precision.

Elian Alfonso Lopez Preciado

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also RWTH, University of Oldenburg

Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks

Forget slow, iterative distributed signal estimation: dMWF achieves optimal multichannel Wiener filtering in wireless acoustic sensor networks without iteration, even when nodes observe different sources.

Paul Didier, Toon van Waterschoot, Simon Doclo +4

Distributed Systems & Hardware Speech & Audio

Arttu Paju +53w ago

External entropy supply for IoT devices employing a RISC-V Trusted Execution Environment

IoT devices struggling with weak entropy can now get a cryptographic boost from a RISC-V trusted execution environment, turning entropy provisioning into a manageable service.

Arttu Paju, Alejandro Cabrera Aldaya, Nicola Tuveri +3

Distributed Systems & Hardware Inference & Quantization

3w ago

Exploiting Label-Aware Channel Scoring for Adaptive Channel Pruning in Split Learning

Achieve higher accuracy and faster convergence in split learning by intelligently pruning communication channels based on label awareness.

Jialei Tan, Xiangming Cai, Ruoxi Zhu +2

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

CMU ML3w ago·also Texas A&M

Better Bounds for the Distributed Experts Problem

Forget shaving yaks – this new protocol slashes communication costs in distributed expert learning while *improving* regret bounds.

David P. Woodruff, Samson Zhou

Distributed Systems & Hardware Training Efficiency & Optimization

Onur Günlü3w ago

Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy

Achieve up to two orders of magnitude reduction in semantic communication rate by strategically incorporating common randomness in a privacy-preserving distributed computation framework.

Onur Günlü

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also JHU, UCSC

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

LLMs can get a 27.8% boost in mathematical reasoning by fusing a hardware-efficient optimal control layer directly into their architecture, enabling planning before prediction.

Peihao Wang, Shanzhe Yang, Shan Yang +9

Distributed Systems & Hardware Reasoning & Chain-of-Thought World Models & Planning

Jiarun Song +23w ago

Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards

Latency's impact on VR whiteboard collaboration isn't uniform: it disproportionately degrades specific QoE dimensions, varying significantly between structured design and free-form discussion.

Jiarun Song, Yongkang Hou, FuZheng Yang

Distributed Systems & Hardware

Vladyslav Parakhin3w ago

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

Traditional time-based authorization schemes are dangerously slow in multi-agent systems: a new coherence strategy slashes unauthorized API calls by over 100x, offering a velocity-agnostic safety guarantee.

Vladyslav Parakhin

Distributed Systems & Hardware Tool Use & Agents

Luyao Zou +53w ago

A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System

Multi-prototype-guided federated learning overcomes data heterogeneity in edge computing, boosting accuracy and reducing errors compared to single-prototype methods.

Luyao Zou, Hayoung Oh, Chu Myaet Thwal +3

Distributed Systems & Hardware Inference & Quantization

A. M. A. S. D. Alagiyawanna +13w ago

Evolution of Photonic Quantum Machine Learning under Noise

Noise in photonic quantum systems severely limits the performance of quantum machine learning algorithms, demanding robust noise mitigation strategies for practical implementations.

A. M. A. S. D. Alagiyawanna, Asoka Karunananda

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

Run Wang +43w ago

TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

On-device fine-tuning of Transformers is now feasible on ultra-low-power, memory-constrained edge devices thanks to TrainDeeploy, which achieves up to 11 trained images per second on a RISC-V SoC.

Run Wang, Victor J. B. Jung, Philip Wiese +2

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

BAIR3w ago·also NVIDIA, Tsinghua AI, Soyeon Caren Han is the corresponding

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.

Shuo Yang, Shuo Yang, Haocheng Xi +21

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Mar 9, 2026

LichtFeld-Studio3w ago

ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting

By ditching Python for optimized C++/CUDA kernels, ImprovedGS+ slashes 3D Gaussian Splatting training time by 26.8% while using 13.3% fewer Gaussians and maintaining superior visual quality.

Jordi Muñoz Vicente

Computer Vision Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also CSIRO, HUST, School of Engineering

Client-Cooperative Split Learning

Achieve near-perfect privacy against clustering and inversion attacks in split learning without sacrificing model accuracy by using differential privacy and secret label obfuscation.

Haiyu Deng, Yanna Jiang, Guangsheng Yu +5

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Leibniz University Hannover3w ago·also UGent

Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds

Caching and speculative transcoding can drastically reduce the computational burden of on-the-fly point cloud transcoding, enabling scalable streaming systems.

Michael Rudolph, Matthias De Fr'e, Matthias De Fré +4

Computer Vision Distributed Systems & Hardware Inference & Quantization

Sriram Devata +23w ago

Serving Compound Inference Systems on Datacenter GPUs

Squeezing 11x more performance from your datacenter GPUs is now possible for compound inference tasks, thanks to JigsawServe's adaptive model selection and fine-grained spatial partitioning.

Sriram Devata, Rahul Singh, Sarita Adve

Distributed Systems & Hardware Inference & Quantization

Microsoft Research3w ago

Lockbox -- A Zero Trust Architecture for Secure Processing of Sensitive Cloud Workloads

Lockbox offers a practical blueprint for enterprises to adopt cloud-based AI processing on sensitive data without compromising security, by implementing a zero-trust architecture.

Vamshi Krishna Thotempudi, Mahima Agarwal, Raghav Batta +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

M. Hafez +73w ago

A Blockchain-based Traceability System for AI-Driven Engine Blade Inspection

Aerospace maintenance gets a trust upgrade: BladeChain uses blockchain to ensure tamper-proof, auditable AI-driven engine blade inspections.

M. Hafez, Mahmoud Hafez, Eman Ouda +5

Computer Vision Distributed Systems & Hardware Scientific Discovery & Drug Design

G. Reali +33w ago

A Hodge-Based Framework for Service Operational Analysis in Serverless Platforms

Uncovers hidden architectural inefficiencies in serverless platforms by applying Hodge decomposition to analyze inter-function information flow.

G. Reali, Gianluca Reali, M. Femminella +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

Revisiting Gradient Staleness: Evaluating Distance Metrics for Asynchronous Federated Learning Aggregation

Euclidean distance isn't the best way to measure gradient staleness in asynchronous federated learning: alternative distance metrics can significantly improve convergence and stability.

Patrick Wilhelm, Odej Kao

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination

Decentralized z-anonymity is now practical: deZent achieves comparable performance to centralized approaches while minimizing reliance on a trusted central entity.

C. Brunn, Florian Tschorsch

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Distributed Systems & Hardware

Mohammad Pishdar +13w ago

Why No Consensus on Consensus? A Deep Dive into Blockchain Consensus Protocols

Blockchain's consensus protocols face critical security, scalability, and energy consumption challenges that demand further research despite their pivotal role in decentralized systems.

Mohammad Pishdar, Jawad Manzoor

Distributed Systems & Hardware

Jian Sheng Wang3w ago

ACE-GF-based Attestation Relay for PQC - Lightweight Mempool Propagation Without On-Path Proofs

Slash blockchain bloat by an order of magnitude: AR-ACE ships compact attestations, not bulky validity proofs, through mempool and relay networks.

Jian Sheng Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Eric G. Ackermann +23w ago

Trust Nothing: RTOS Security without Run-Time Software TCB (Extended Version)

Imagine an embedded OS where the scheduler, allocator, DMA drivers, and all peripherals are fully untrusted—this paper shows how to build it.

Eric G. Ackermann, Eric Ackermann, Sven Bugiel

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Jian Sheng Wang3w ago

ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems

Slash blockchain transaction sizes by an order of magnitude with ZK-ACE, which replaces bulky post-quantum signatures with succinct, identity-based zero-knowledge proofs.

Jian Sheng Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

FedPrism: Adaptive Personalized Federated Learning under Non-IID Data

FedPrism dynamically adapts to non-IID data in federated learning by decomposing client models into global, group, and private components, outperforming traditional aggregation methods.

Prakash Kumbhakar, Shrey Srivastava, Haroon R Lone

Distributed Systems & Hardware Training Efficiency & Optimization

Vignesh Adhinarayanan +13w ago

The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

MoE models, despite their training efficiency, can be structurally 4.5x slower than quality-matched dense models at inference due to memory fragmentation, especially in long-context scenarios.

Vignesh Adhinarayanan, N. Jayasena

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mila3w ago·also Covenant AI

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Democratized LLM pre-training is now a reality: Covenant-72B proves you can train a competitive 72B model with untrusted peers over the internet, opening the door to broader participation and reduced costs.

J. Lidin, Joel Lidin, Amirm. Sarfi +11

Distributed Systems & Hardware Open-Source Models & Weights Training Efficiency & Optimization

3w ago·also SYSU

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Get 3.6x faster long-context LLM inference with LycheeCluster's hierarchical KV indexing, which avoids the semantic fragmentation of naive chunking.

Dongfang Li, Zixuan Liu, Gang Lin +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types

FPGAs aren't just for SmartNICs anymore: SafarDB shows they can directly accelerate distributed transactions with 7-12x speedups by tightly integrating with the network.

Javad Saberlatibari, Prithviraj Yuvaraj, Mohsen Lesani +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

JV Roig3w ago

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

LLMs hallucinate far more than you think in document Q&A, with fabrication rates tripling as context grows from 32K to 128K tokens, and model selection matters more than hyperparameter tuning or hardware.

JV Roig

Distributed Systems & Hardware Eval Frameworks & Benchmarks Natural Language Processing

Daniel M. Jimenez-Gutierrez +53w ago

FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data

FedLECC slashes communication overhead in federated learning by 50% while boosting accuracy by 12%, all by cleverly picking clients based on data similarity and loss.

Daniel M. Jimenez-Gutierrez, Giovanni Giunta, Mehrdad Hassanzadeh +3

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Conor Flynn +23w ago

Online Sparse Synthetic Aperture Radar Imaging

Overcome memory bottlenecks in drone-based Synthetic Aperture Radar (SAR) imaging with a new online reconstruction method that processes data incrementally.

Conor Flynn, Radoslav Ivanov, Birsen Yazici

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Peishen Yan +33w ago

FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning

SVD-powered aggregation in FedMomentum lets LoRA modules in federated learning retain crucial training momentum, leading to faster convergence and better performance.

Peishen Yan, Yang Hua, Tao Song +1

Distributed Systems & Hardware Natural Language Processing Training Efficiency & Optimization

UNC Greensboro3w ago·also UVA

HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings

Federated differentially private data synthesis can now achieve utility comparable to centralized approaches, even with heterogeneous data distributions, thanks to a novel framework that smartly handles noise and redundancy.

Xiaochen Li, Fengyu Gao, Xizixiang Wei +3

Constitutional AI & AI Ethics Data Curation & Synthetic Data Distributed Systems & Hardware

Jianshu She3w ago

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Unlock cloud-scale AI for enterprises without sacrificing data privacy: SplitAgent dynamically sanitizes sensitive data based on task context, boosting accuracy and privacy compared to static methods.

Jianshu She

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Tool Use & Agents

Chang Han +33w ago

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Tree speculative decoding can achieve up to 2.46x speedup on Ascend NPUs, but only if you carefully manage the branch/commit cache and eliminate undefined negative indices.

Chang Han, Yijie Hu, Jinglin Liu +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

GOMA: Geometrically Optimal Mapping via Analytical Modeling for Spatial Accelerators

Achieve global-optimal GEMM mapping for spatial accelerators orders of magnitude faster than existing methods by analytically modeling the mapping space geometrically.

Wulve Yang, H. Zou, Hailong Zou +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Northwestern

SI-ChainFL: Shapley-Incentivized Secure Federated Learning for High-Speed Rail Data Sharing

A Shapley-incentivized blockchain boosts federated learning accuracy by 14% and thwarts 90% of malicious attacks in high-speed rail data sharing.

Mingjie Zhao, Cheng Dai, Fei Chen +5

Data Curation & Synthetic Data Distributed Systems & Hardware

D. Pizzo +13w ago

Lattice: A Post-Quantum Settlement Layer

Lattice dares to launch a cryptocurrency designed from the ground up to be post-quantum secure, ditching classical signature fallbacks entirely.

D. Pizzo, David Alejandro Trejo Pizzo

Distributed Systems & Hardware

Tsinghua AI3w ago

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Beat the LLM inference bottleneck: SageSched's uncertainty-aware scheduling boosts efficiency by nearly 30% by predicting output length and balancing compute and memory demands.

Z. Gan, Zhenghao Gan, Yichen Bao +4

Distributed Systems & Hardware Inference & Quantization

S. M. Jenab +23w ago·also OTI Lumionics Inc.

Parallel iQCC Enables 200 Qubit Scale Quantum Chemistry on Accelerated Computing Platforms Surpassing Classical Benchmarks in Ruthenium Catalysts

Quantum advantage in chemistry may be further off than we thought: a new GPU-accelerated iQCC implementation simulates 100-200 qubit systems, outperforming classical methods on industrially relevant ruthenium catalysts.

S. M. Jenab, B. Henderson, Scott N. Genin

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

CMU ML3w ago·also Microsoft Research, Brandeis, Glasgow, USC

CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference

Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.

Zijian Shen, Wenyu Zhao, Boyuan Wang +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

3w ago

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

LLMs waste 21.8% of their context window on structural inefficiencies, but a demand paging system can slash context consumption by up to 93% without sacrificing performance.

Tony Mason

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

Split Federated Learning Architectures for High-Accuracy and Low-Delay Model Training

Achieve 3% accuracy gains and 20% delay reduction in split federated learning simply by jointly optimizing model partitioning and client assignments.

Yiannis Papageorgiou, Yannis Thomas, Ramin Khalili +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Mar 8, 2026

Chong Guan3w ago

PoEW:Encryption as Consensus and Enabling Data Compression Services?

Turn energy-intensive crypto mining into a data compression service with Proof-of-Encryption-Work (PoEW), a novel consensus mechanism.

Chong Guan

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago·also Beijing Institute of Computer Technology

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

Bridge the trust gap in cloud-based LLM services with AFTUNE, a practical framework that lets you audit proprietary fine-tuning and inference without prohibitive overhead.

Heng Jin, Chaoyu Zhang, Hexuan Yu +4

Distributed Systems & Hardware Inference & Quantization Open-Source Models & Weights

Parisa Vahdatian3w ago

MAS-H2: A Hierarchical Multi-Agent System for Holistic Cloud-Native Autoscaling

Cloud autoscaling can be more than just reactive: MAS-H2 shows how a hierarchical multi-agent system can proactively optimize resource allocation based on high-level business policies, slashing CPU stress by 50% and enabling zero-downtime migrations.

Parisa Vahdatian

Distributed Systems & Hardware Tool Use & Agents

Independent Researcher3w ago

A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture

Forget CPU bottlenecks: a fully GPU-resident architecture verifies Goldbach's conjecture up to $10^{12}$ in under 40 seconds on a single RTX 5090.

Isaac Llorente-Saguer

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

Rahul Bera3w ago

Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques

By intelligently leveraging application data characteristics and machine learning, microarchitectural designs can overcome memory bottlenecks and achieve substantial performance and energy efficiency gains.

Rahul Bera

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Georgios Koukis +103w ago

Performance Evaluation of Automated Multi-Service Deployment in Edge-Cloud Environments with the CODECO Toolkit

Automating multi-service deployments in edge-cloud environments doesn't have to be a headache: CODECO slashes manual effort while keeping performance competitive.

Georgios Koukis, Ioannis Dermentzis, Vassilis Tsaoussidis +8

Code Generation & Program Synthesis Distributed Systems & Hardware

Docyt3w ago·also Tata Consultancy Services Limited

Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks

Slash overhead and boost resilience in massive dynamic networks with Structured Gossip DNS, a passively stabilizing system that cuts message complexity by an order of magnitude.

Priyanka Sinha, Dilys Thomas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Paul Borrill3w ago

Link Wars: The Semantic Crisis. Is the debate over or is it just beginning?

Today's high-performance interconnects are built on shaky semantic ground, potentially sacrificing concurrency for reliability through hidden serialization.

Paul Borrill

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Colorado State University3w ago

Accelerating Diffusion Models for Generative AI Applications with Silicon Photonics

Diffusion models can now run with 3x better energy efficiency and 5.5x higher throughput thanks to a silicon photonics accelerator.

Tharini Suresh, Salma Afifi, Sudeep Pasricha

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware

NVIDIA3w ago·also Tongji

Scalable Training of Mixture-of-Experts Models with Megatron Core

Training trillion-parameter Mixture-of-Experts models just got a whole lot faster: Megatron Core now achieves >1 PFLOP/GPU on NVIDIA's latest hardware.

Zijie Yan, Hongxiao Bai, Xin Yao +37

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

Squeeze 46% more LLM inference throughput from your many-core CPUs with ArcLight, a new architecture that overcomes the cross-NUMA memory access bottleneck.

Yuzhuang Xu, Wanxiang Che

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Davide Mancino +13w ago

SoK: The Evolution of Maximal Extractable Value, From Miners to Cross-Chain

MEV has evolved from simple miner extraction to a complex cross-chain phenomenon, and this SoK provides a unified framework to understand its past, present, and future.

Davide Mancino, Hasret Ozan Sevim

Distributed Systems & Hardware