April 24 – May 1, 2026

Distributed Systems & Hardware - Weekly Roundup

100 papers published across 5 labs.

Selected Labs publishing this week

Top Papers

Apr 30, 2026

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

A study on the performance of distributed training of data-driven CFD simulations

Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.

Sergio Iserte, A. González-Barberá, Alejandro González-Barberá +25

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning

Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.

Sergio Iserte, Iker Martín-Álvarez, Iker Martín-Alvarez +75

Distributed Systems & Hardware

Apr 28, 2026

Emre Ardıç +23w ago·also Gebze Technical University

Sample selection using multi-task autoencoders in federated learning with non-IID data

Federated learning accuracy jumps by up to 7% simply by using a multi-task autoencoder to identify and filter out noisy or uninformative samples on each client.

Emre Ardıç, Emre Ardiç, Yakup Genç

Computer Vision Distributed Systems & Hardware Training Efficiency & Optimization

Andrew E. M. Lewis-Pye +23w ago·also London School of Economics

Volitional Multiagent Atomic Transactions: Describing People and their Machines

Finally, a formal model that treats humans as more than just external noise in distributed systems, opening the door to verifiable grassroots platforms.

Andrew E. M. Lewis-Pye, Andy Lewis-Pye, Ehud Shapiro

Distributed Systems & Hardware Tool Use & Agents

May 1, 2026

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

All Papers (100)

May 1, 2026

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Apr 30, 2026

Shun Takagi +13w ago

Shuffling-Aware Optimization for Private Vector Mean Estimation

Shuffling data introduces a fundamental shift in the privacy-utility tradeoff for mean estimation, rendering locally differentially private (LDP) mechanisms suboptimal.

Shun Takagi, Seng Pei Liew

Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also BU, Cornell, NTT Physics and Informatics Laboratories

Physical Foundation Models: Fixed hardware implementations of large-scale neural networks

Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.

Logan G. Wright, Tianyu Wang, Tatsuhiro Onodera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also RUC

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.

Yuhua Wang, Qinnan Zhang, Xiaodong Li +6

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning

Foundation model embeddings reveal hidden structure in federated datasets, enabling surprisingly effective client clustering without any training or communication overhead.

Mahad Ali, M. Ali, Laura J. Brattain

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Sivaram Krishnan +63w ago

Toward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach

Managing thousands of LEO satellites just got easier: a novel graph learning approach slashes network management overhead while boosting forecasting accuracy.

Sivaram Krishnan, S. Krishnan, Bassel Al Homssi +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Xubin Luo +13w ago

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework

Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.

Xubin Luo, Yang Cheng

Distributed Systems & Hardware Inference & Quantization

Mohd Sameen Chishti +23w ago

Feature-Centric Methodology for Analyzing Cross-Chain NFT Migration Compatibility

Stop costly cross-chain NFT migrations before they start: a new feature-centric methodology predicts which NFT functionalities will break when moving between blockchains like Ethereum and Solana.

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Dolby Labs

ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

Volumetric videoconferencing doesn't have to freeze and stutter: ReVo recovers up to 32% of lost RGB data and slashes video freezes by 95% using a cross-layer approach.

Ankur Aditya, Diptyaroop Maji, Lingdong Wang +6

Computer Vision Distributed Systems & Hardware

3w ago·also RUC

Akita: A High Usability Simulation Framework for Computer Architecture

Frustrated with clunky architecture simulators? Akita offers a breath of fresh air with its focus on developer experience, promising faster prototyping and experimentation.

Sabila Al Jannat, Sabila Al Jannat, Ying Li +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Muhammad Ihsan Al Hafiz +33w ago

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.

Muhammad Ihsan Al Hafiz, Muhammad Ihsan Al Hafiz, Artur Podobas +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also ANL

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Cerebras CS-3 can deliver 100x speedups over CPU for sparse matrix multiplication at 90% sparsity, but surprisingly, becomes *slower* than CPU beyond 99% sparsity.

Milan Shah, Sheng Di, Michela Becchi

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Tijn de Vos +43w ago

Distributed Santa Claus via Global Rounding

Even approximately fair gift-giving is surprisingly hard in distributed systems: achieving any approximation for the Santa Claus problem requires $\Omega(\sqrt{n} + D)$ rounds.

Tijn de Vos, Leo Wennmann, Malte Baumecker +2

Distributed Systems & Hardware

MEV-X3w ago·also HSE University, Moscow Institute of Physics and Technology, Skoltech

The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale

Most MEV arbitrage opportunities on Polygon can be traced back to a single transaction, revealing surprising concentration in MEV creation across protocols.

Andrei Seoev, Dmitry Belousov, Anastasiia Smirnova +6

Distributed Systems & Hardware

Jin Xin Ng +103w ago

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Schedulers can boost throughput by 12% on chiplet-based systems simply by treating spatial locality as a first-class objective, even if it means sacrificing work-conservation.

Jin Xin Ng, Ori Livneh, Richard O'Grady +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Behnaz Ranjbar +13w ago

AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality Systems

Balancing processor utilization and Quality-of-Service in mixed-criticality systems just got easier with AnTi-MiCS and MulTi-MiCS, which automatically determine optimal low WCETs and improve QoS by up to 30%.

Behnaz Ranjbar, Akash Kumar

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Beijing Academy of Blockchain and Edge, Fudan, Shanghai Academy of Future Internet

Back to the Future: Rethinking Endorsement in Order-Execute Blockchains

Order-execute blockchains can achieve 10x higher throughput in DeFi workloads by embedding flexible endorsement directly into the consensus mechanism, avoiding the high abort rates of execute-order-validate approaches.

Rongji Huang, Yifeng Ye, Gerui Wang +5

Distributed Systems & Hardware

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

A study on the performance of distributed training of data-driven CFD simulations

Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.

Sergio Iserte, A. González-Barberá, Alejandro González-Barberá +25

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning

Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.

Sergio Iserte, Iker Martín-Álvarez, Iker Martín-Alvarez +75

Distributed Systems & Hardware

3w ago·also Samsung Electronics

AME-PIM: Can Memory be Your Next Tensor Accelerator?

HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.

Emanuele Venieri, Simone Manoni, Alberto Florian +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jisheng Zhao +43w ago

CuLifter: Lifting GPU Binaries to Typed IR

Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.

Jisheng Zhao, Huanzhi Pu, Shinnung Jeong +2

Code Generation & Program Synthesis Distributed Systems & Hardware Inference & Quantization

Yan-Cheng Guo +23w ago

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zi-Wei Lin +23w ago

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.

Zi-Wei Lin, Zimiao Lin, Tian-Sheuan Chang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

WOOTdroid: Whole-system Online On-device Tracing for Android

Android's security-relevant IPC is now traceable on stock devices without app instrumentation, closing a critical visibility gap for security researchers and incident responders.

Simon Althaus, S. Althaus, Nikolaos Alexopoulos +7

Distributed Systems & Hardware

3w ago·also INRIA

Strait: Perceiving Priority and Interference in ML Inference Serving

Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.

Haidong Zhao, Nikolaos Georgantas, Nikolaos Georgantas

Distributed Systems & Hardware Inference & Quantization

3w ago·also Tsinghua AI

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.

Zhiqiang Kou, Zhi Kou, Jun Wu +11

Data Curation & Synthetic Data Distributed Systems & Hardware Natural Language Processing

Zhenzhou Jin +33w ago

Statistical Channel Fingerprint Construction for Massive MIMO: A Unified Tensor Learning Framework

Ditch the encoder-decoder: LPWTNet's closed-form Laplacian pyramid decomposition offers efficient inference for statistical channel fingerprint construction in massive MIMO systems.

Zhenzhou Jin, Li You, Xiang-Gen Xia +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Zehui Tang +33w ago·also MIIT Key Laboratory of Pattern Analysis, NJU

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Adaptively weighting defenses in federated learning lets you robustly handle diverse attacks without needing the dataset on the server.

Zehui Tang, Yuchen Liu, F. Huang +1

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

3w ago

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Achieve 100% agent recovery correctness with near-zero overhead by intelligently checkpointing only the OS state that actually matters.

Tianyuan Wu, Chaokun Chang, Chaokun Chang +4

Distributed Systems & Hardware Tool Use & Agents

Wenxiang Lin +53w ago·also HIT

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.

Wenxiang Lin, Xinglin Pan, Ruibo Fan +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.

Jiasheng Zheng, Xin Zheng, Boxi Cao +9

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Apr 29, 2026

Akshay Karjol +13w ago

Real-Time GPU-Accelerated Monte Carlo Evaluation of Safety-Critical AEB Systems Under Uncertainty

Real-time, GPU-accelerated Monte Carlo simulation makes probabilistic safety guarantees for Automatic Emergency Braking systems deployable, not just a validation afterthought.

Akshay Karjol, Shadi Alawneh

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Ahan Gupta +43w ago·also Snowflake

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Training LLMs on ultra-long contexts just got a whole lot easier: AutoSP automates sequence parallelism and activation checkpointing, boosting context length by up to 2.7x with negligible throughput cost.

Ahan Gupta, Zhihao Wang, Neel Dani +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

Adaptive Self-Organization in Anonymous Dynamic Networks

Even with adversarial network changes and only local signals, surprisingly simple distributed algorithms can enable dynamic networks to self-organize and adapt to changing environmental goals.

Garrett Parzych, Joshua J. Daymude

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Barcelona Supercomputing Center3w ago·also UPC

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.

Mar Tejedor, Javier Conejero, Rosa M. Badia

Distributed Systems & Hardware Inference & Quantization

University of Artificial Intelligence3w ago

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

Fixing your parallelism strategy while tuning batch size (or vice versa) leaves performance on the table: COPUS adaptively co-tunes both for faster LLM training.

Akhmed Sakip, Erland Hilman Fuadi, Omar Sayedelahl +6

Distributed Systems & Hardware Training Efficiency & Optimization

Tianhao Hu +163w ago

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.

Tianhao Hu, Xiangcheng Liu, Youshao Xiao +14

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Timothy Flavin +13w ago

A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

Training complex multi-agent RL policies just got 3,500x faster thanks to a new engine that optimizes for memory access and data locality.

Timothy Flavin, Sandip Sen

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

Hyunsung Yoon +33w ago

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.

Hyunsung Yoon, Sungju Ryu, Sungju Ryu +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Barcelona Supercomputing Center (BSC)3w ago

Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

A holistic, industrial-grade V&V loop promises to accelerate and de-risk RISC-V chip design by integrating RTL validation, FPGA-based system-level testing, and continuous integration.

Sajjad Ahmed, Alexander Kropotov, Roberto Ignacio Genovese +21

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Alexander Kropotov +23w ago

EMiX: Emulating Beyond Single-FPGA Limits

Emulating massive multi-core systems just got easier: EMiX lets you scale RISC-V emulation across multiple FPGAs without rewriting your RTL.

Alexander Kropotov, Miquel Moreto, Behzad Salami

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

LLM-Guided Runtime Parameter Optimization for Energy-Efficient Model Inference

Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.

Katelyn Crumpacker, Dimitrios Nikolopoulos

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Ericsson AB3w ago·also KTH

Where did we fail? -- Reproducing build failures in embedded open source software

Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.

Han Fu, Andreas Ermedahl, Sigrid Eldh +3

Code Generation & Program Synthesis Distributed Systems & Hardware Open-Source Models & Weights

Verint Systems Inc3w ago

When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems

Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.

Emma Casey, David Roberts, David Sim +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Nankai University3w ago·also Tsinghua AI, NYU

Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?

Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.

Runzhou Wang, Shenglin Zhang, Wenwei Gu +5

Distributed Systems & Hardware

3w ago

End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric

Overlapping validation and private-data acquisition of successive blocks with state-consistency checks and ledger updates can almost double Hyperledger Fabric's commit throughput.

Pavan Sollu, Aniruddha Mukherjee, Divya Pulivarthi +6

Distributed Systems & Hardware Training Efficiency & Optimization

Tsinghua AI3w ago

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tune massive LLMs like Qwen3-235B with 31K context on a single 8x RTX 4090 server, thanks to a novel pipeline schedule that eliminates the weight binding bottleneck.

Yibin Luo, Yi Luo, Shiwei Gao +3

Distributed Systems & Hardware Training Efficiency & Optimization

Barcelona Supercomputing Center (BSC)3w ago

A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.

Petter Sandås, Íñigo Aréjula-Aísa

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

NVIDIA3w ago

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.

Hayate Iso, Tiyasa Mitra, Sudipta Mondal +22

Distributed Systems & Hardware Inference & Quantization RLHF & Preference Learning+1

Yiqi Liu +43w ago

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.

Yiqi Liu, Noelle Crawford, Michael Wang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Rafael Mayo +13w ago

DMRlib: Easy-coding and Efficient Resource Management for Job Malleability

Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.

Rafael Mayo, Enrique S. Quintana-Ortí

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Bodon Jeong +83w ago

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.

Bodon Jeong, Bodon Jeong, H.I. Byun +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

ETH3w ago·also ANU, Sydney

FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps

Training a 1024-node SOM on a billion-sample dataset in just over 6 minutes shatters previous scalability limits, thanks to a novel framework that leverages multi-GPU execution, out-of-memory streaming, and flexible topologies.

Tony Xu, Sarah Klamt, Katherine Turner +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Toulouse INP

MPI Malleability Validation under Replayed Real-World HPC Conditions

MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.

S. Iserte, M. Madon, G. Da +2

Distributed Systems & Hardware

Anna Golubeva +13w ago

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

Squeeze more out of your hardware: TSP lets you shard both weights and activations across the same devices, unlocking memory savings for long-context training and inference.

Anna Golubeva, Quentin Anthony

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Yimeng Shan +83w ago·also ANL, NJUST

SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning

Fine-tuning LLMs in federated settings just got easier: SplitFT lets clients adapt their cut layers and LoRA ranks, boosting performance and slashing communication costs.

Yimeng Shan, Yimeng Shan, Zhaorui Zhang +6

Distributed Systems & Hardware Natural Language Processing Training Efficiency & Optimization

Aditya Ukarande +73w ago

Efficient, VRAM-Constrained xLM Inference on Clients

Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.

Aditya Ukarande, Aditya Ukarande, Deep Shekhar +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

3w ago

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

NVIDIA's closed-source driver secrets are out: researchers can now see the exact hardware commands triggered by CUDA code.

Yuang Yan, Ian Karlin, Ryan Grant

Distributed Systems & Hardware Open-Source Models & Weights

3w ago·also Georgia Tech

Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

Tomorrow's 6G networks hinge on overcoming the design hurdles of mm-wave and sub-THz oscillators, and this review lays out the roadmap.

Baktash Behmanesh, Ahmad Rezvanitabar

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Vrije Universiteit Amsterdam

What Is the Cost of Energy Monitoring? An Empirical Study on the Overhead of RAPL-Based Tools

Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.

Jeremy Diamond, Vincenzo Stoico

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Central Research Laboratory3w ago

Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware

Quantum Gatekeeper achieves near-perfect information hiding: without all four factors (password, shared secret, context string, and reference image signature), payload extraction fails silently, preventing even partial disclosure.

Sahil Tomar

Computer Vision Distributed Systems & Hardware

Pericle Perazzo +13w ago

Catching the Fly: Practical Challenges in Making Blockchain FlyClient Real

FlyClient, a lightweight blockchain verification protocol, gets closer to real-world deployment with a practical Zcash implementation and proof-size optimizations.

Pericle Perazzo, Dario Capecchi

Distributed Systems & Hardware

University of the Cumberlands3w ago

Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes

Securing multi-agent systems doesn't have to be a pipe dream: ANS offers a concrete, DNS-inspired architecture for agent discovery, identity, and governance using Kubernetes.

Akshay Mittal, Elyson De La Cruz

Constitutional AI & AI Ethics Distributed Systems & Hardware Tool Use & Agents

3w ago

Towards Intelligent Computation Offloading in Dynamic Vehicular Networks: A Scalable Multilayer Pipeline

A modified Particle Swarm Optimization algorithm slashes computation offloading latency in vehicular networks, outperforming brute-force methods in dynamic, real-world scenarios.

Falk Dettinger, Matthias Weiß, Baran Can Gül +3

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

3w ago·also Illinois Institute of Technology

StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

Squeezing more out of 5G video calls is possible: StreamGuard boosts video conferencing quality by up to 70% by intelligently prioritizing different parts of the video stream.

Xuyang Cao, Oliver Michel, Kyle Jamieson

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Apr 28, 2026

Viplove Goswami3w ago

Governing APIs at Scale: An Enterprise Framework Using Google Apigee in Multi-Cloud Environments

Slash configuration drift by 42% and boost API propagation by 31% with this framework for governing APIs across AWS, Azure, and GCP.

Viplove Goswami

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Shuchen Zhu +33w ago

Subspace Optimization for Efficient Federated Learning under Heterogeneous Data

Federated learning can achieve better accuracy-efficiency trade-offs under heterogeneous data by optimizing within a low-dimensional subspace and using a backfill-style update to retain residual components.

Shuchen Zhu, Zhengyang Huang, Yuqi Xu +1

Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Cornell, Stony Brook

GraphPL: Leveraging GNN for Efficient and Robust Modalities Imputation in Patchwork Learning

Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.

Xingjian Hu, Zuoyu Yan, Jianhua Zhu +3

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

3w ago

Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.

Haoran Chen, Wentao Wang

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Changyu Li +73w ago

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.

Changyu Li, Shuanghong Huang, Jiashen Liu +5

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.

Srikanta Prasad S, Utkarsh Arora

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yongtao Yao +53w ago

QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

Quantum-inspired attention networks can significantly improve task offloading performance in MEC networks, offering a practical path to more energy-efficient and sustainable edge computing.

Yongtao Yao, Yao Yang, Haorui Shi +3

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Hala ElAarag +13w ago

Hands-on PDC in Undergraduate Computing Education

Give undergrads supercomputer access, and they'll actually grok parallel computing.

Hala ElAarag, Anas Gamal Aly

Distributed Systems & Hardware Training Efficiency & Optimization

Verdict Security3w ago·also Ain Shams University

Prime-Field PINI: Machine-Checked Composition Theorems for Post-Quantum NTT Masking

Fresh masking between pipeline stages in NTT-based post-quantum crypto isn't just good practice, it's provably necessary to erase vulnerabilities arising from prior stages, as demonstrated with a machine-checked proof and a real-world hardware flaw.

Ray Iskander, Khaled Kirah

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Free University of Bozen-Bolzano3w ago·also Arizona, Nutrosal Inc.

Key Developer Roles and Organizational Coupling in Microservices: A Longitudinal Analysis

Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.

Xiaozhou Li, Nariman Mani, Jose Sosa Rodriguez +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

3w ago·also DTU, Verified Systems International GmbH

Scenario-based System Testing for Distributed Robotics Applications

Automating system-level testing for distributed robotics is now more practical with a new language that handles complexity, non-determinism, and dynamic reconfiguration.

Jan Peleska, Felix Brüning, Wen-Ling Huang +1

Code Generation & Program Synthesis Distributed Systems & Hardware Robotics & Embodied AI

Zhouzhi Xiong +43w ago

DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.

Zhouzhi Xiong, Zimou Zeng, Shu Xu +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

Ce Zheng +63w ago·also Pengcheng Laboratory

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.

Ce Zheng, Xinghan Wang, Jiahong Ning +4

Distributed Systems & Hardware Inference & Quantization

P. Bechtle +163w ago·also Bonn, Forschungszentrum Jülich, Heidelberg, KIT

Economical and ecological impact of sector coupling applied to computing clusters

Shifting computing workloads to periods of high renewable energy availability slashes both carbon emissions and operational costs for HPC clusters.

P. Bechtle, Oliver Freyermuth, O. Freyermuth +14

Distributed Systems & Hardware Training Efficiency & Optimization

H. Babak +13w ago

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Unlock significant speedups in depthwise convolutions (up to 3.26x) with optimized CUDA kernels, even in restricted cloud environments lacking hardware performance counters.

H. Babak, Melanie Schaller

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Zihao Xuan +63w ago

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.

Zihao Xuan, Jia Chen, Yewen Li +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yigang Geng +33w ago

Multi-Periodogram Velocity Estimation with Irregular Reference Signals for Robot-Aided ISAC

Key contribution not extracted.

Yigang Geng, Pan Cao, Ting Zeng +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Robotics & Embodied AI

Shouxu Lin +23w ago

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Forget prefetching: DAK unlocks up to 3x faster LLM inference by enabling direct GPU access to remote memory, achieving near-optimal system bandwidth utilization.

Shouxu Lin, Zhiyuan Guo, Jiaxin Lin

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

DAMO3w ago·also BAIR, Tsinghua AI, Intel Labs, Rice

Pythia: Toward Predictability-Driven Agent-Native LLM Serving

Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.

Xin Jin, Xuanzhe Liu

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Andrew E. M. Lewis-Pye +23w ago·also London School of Economics

Volitional Multiagent Atomic Transactions: Describing People and their Machines

Finally, a formal model that treats humans as more than just external noise in distributed systems, opening the door to verifiable grassroots platforms.

Andrew E. M. Lewis-Pye, Andy Lewis-Pye, Ehud Shapiro

Distributed Systems & Hardware Tool Use & Agents

Vyom Sharma +13w ago

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Stop leaving 10-70% of your MoE kernel throughput on the table: RaMP dynamically optimizes kernel configuration based on runtime expert routing, achieving up to 1.41x end-to-end speedup in vLLM serving.

Vyom Sharma, Debajyoti Datta

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Faculty of Informatics Institute of Informatics3w ago·also Research Group Parallel Computing, TU Wien

Two Efficient Message-passing Exclusive Scan Algorithms

Exclusive scan algorithms, often overlooked, get a speed boost with two new approaches that minimize communication overhead in parallel message-passing systems.

Jesper Larsson Träff, J. L. Traff

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Ma Zirui +73w ago

AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

Mobile LLM inference just got a whole lot faster: AHASD achieves up to 4.2x throughput and 5.6x energy efficiency gains by intelligently decoupling and managing drafting and verification tasks on a PIM-NPU architecture.

Ma Zirui, Zhihua Fan, Wenxin Li +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Smart Sensors Group at Hamburg3w ago·also Hamburg University of Technology

At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

On-device cardiac monitoring is now feasible on ultra-low-power wearables, achieving 98% accuracy at just 8.55mW.

Kazi Mohammad Abidur Rahman, Davis Rakhshan, Philipp Lütke +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware+1

Archisman Ghosh +23w ago

No Tile Left Behind: Multiprogramming for Surface-Code Architectures

FTQC multiprogramming is not just about qubit partitioning; it's a complex puzzle of structured floorplans, resource contention, and dynamic magic-state generation, and this work provides a framework to solve it.

Archisman Ghosh, Avimita Chatterjee, Swaroop Ghosh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Robin Geens +33w ago·also ∗Equal contribution

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

LUT-based hardware architectures can achieve up to 2.2x area reduction for LLM inference by challenging conventional design assumptions and optimizing for activation data types.

Robin Geens, Joran Heldens, Joren Dumoulin +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mingbo Hao +73w ago·also SEU

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Forget GPUs – NVLLM's 3D NAND-centric design slashes LLM inference latency by up to 37.9x on edge devices, making on-device LLMs a real possibility.

Mingbo Hao, Changwei Yan, Haoyu Cui +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jangho Baik +43w ago

RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping

RecFlash slashes recommendation inference latency by up to 81% and energy consumption by nearly 92% through smart data remapping in NAND flash memory.

Jangho Baik, Sunghyun Kim, Gisan Ji +2

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

3w ago·also NVIDIA, Columbia, Samsung Semiconductor, Yonsei

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.

Zhongkai Yu, Haotian Ye, Haotian Ye +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ming Chen +33w ago

Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions

Current adaptive microservice management systems only scratch the surface of real-world production dynamics, and their purported gains may be overstated.

Ming Chen, Muhammed Tawfiqul Islam, M. R. Read +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Ke Dong +33w ago

TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing

TetrisG-SDK achieves up to 1.3x faster convolutional layer processing while slashing energy consumption by over 70% in some cases.

Ke Dong, Kejie Huang, Tao Luo +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sean Nian +43w ago

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

CacheFlow slashes LLM serving latency by up to 62% by rethinking KV cache restoration as a 3D-parallel scheduling problem, not just a recompute vs. I/O tradeoff.

Sean Nian, Jiahao Fang, Qilong Feng +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Emre Ardıç +23w ago·also Gebze Technical University

Sample selection using multi-task autoencoders in federated learning with non-IID data

Federated learning accuracy jumps by up to 7% simply by using a multi-task autoencoder to identify and filter out noisy or uninformative samples on each client.

Emre Ardıç, Emre Ardiç, Yakup Genç

Computer Vision Distributed Systems & Hardware Training Efficiency & Optimization

Apr 27, 2026

Alex Bienstock +73w ago

Scalable Secure Biometric Authentication without Auxiliary Identifiers

Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.

Alex Bienstock, Daniel Escudero, Antigoni Polychroniadou +5

Distributed Systems & Hardware Inference & Quantization

Search

Distributed Systems & Hardware - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)