April 27 – May 4, 2026

Distributed Systems & Hardware - Weekly Roundup

100 papers published across 4 labs.

2700% acceleration

Selected Labs publishing this week

Tsinghua AI3 NVIDIA2 Microsoft Research1 ETH1

Top Papers

Apr 30, 2026

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

A study on the performance of distributed training of data-driven CFD simulations

Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.

Sergio Iserte, A. González-Barberá, Alejandro González-Barberá +25

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning

Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.

Sergio Iserte, Iker Martín-Álvarez, Iker Martín-Alvarez +75

Distributed Systems & Hardware

May 4, 2026

3w ago·also USC

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Scientific Discovery & Drug Design Tool Use & Agents

Qipeng Wang +13w ago

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.

Qipeng Wang, Zhendong Yang

Distributed Systems & Hardware Inference & Quantization

3w ago·also USC

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Robotics & Embodied AI Scientific Discovery & Drug Design

All Papers (100)

May 4, 2026

3w ago·also USC

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Scientific Discovery & Drug Design Tool Use & Agents

Qipeng Wang +13w ago

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.

Qipeng Wang, Zhendong Yang

Distributed Systems & Hardware Inference & Quantization

3w ago·also USC

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Robotics & Embodied AI Scientific Discovery & Drug Design

Jenny Lynn Almerol +33w ago·also Studi Avanzati (SISSA)

Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}

RISC-V accelerators, originally for AI, can efficiently run scientific simulations, but only with the right parallelization strategy.

Jenny Lynn Almerol, Elisabetta Boella, Mario Spera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

3w ago·also University of Jyväskylä Jyväskylä

Distributed Quantum Circuit Optimisation: Evaluating Global and Local encodings

Quantum circuit optimization doesn't always improve distributed execution: sometimes, local optimization surprisingly beats global methods at minimizing communication costs.

Maria Gragera Garces, Majid Haghparast

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also HSE University, IDEAS: Inter-Disciplinary & Advanced, Mid Hope Technologies, Moscow Institute of Physics and Technology +2

Caliper-in-the-Loop: Black-Box Optimization for Hyperledger Fabric Performance Tuning

Bayesian optimization can automatically tune Hyperledger Fabric configurations to achieve double-digit throughput improvements, but the impact of measurement noise on interpreting gains cannot be ignored.

Yash Madhwal, Arseny Bolotnikov, Mark Prikhno +5

Distributed Systems & Hardware Training Efficiency & Optimization

Hongbin Zhang +53w ago

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

Commodity GPU servers can achieve surprisingly high LLM inference throughput by cleverly orchestrating pipeline parallelism with KV cache offloading.

Hongbin Zhang, Taosheng Wei, Jiazhi Jiang +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ahmad Dabaja +13w ago

FedPLT: Scalable, Resource-Efficient, and Heterogeneity-Aware Federated Learning via Partial Layer Training

FedPLT achieves full-model accuracy in federated learning while training up to 82% fewer parameters per client, slashing communication costs and enabling participation from resource-constrained devices.

Ahmad Dabaja, Rachid El-Azouzi

Distributed Systems & Hardware Training Efficiency & Optimization

Mohammadreza Doostmohammadian +13w ago

Distributed Observer-based Fault Detection over Intelligent Networked Multi-Vehicle Systems

CAVs can now detect sensor anomalies in their measurements without relying on a central unit, even when tracking human-driven vehicles that aren't directly observable.

Mohammadreza Doostmohammadian, Hamid R. Rabiee

Distributed Systems & Hardware Robotics & Embodied AI

S. Catalán +23w ago

Leveraging Teaching on Demand: Approaching HPC to Undergrads

Hands-on experience with Raspberry Pi clusters and student-driven learning can effectively bridge the HPC skills gap in undergraduate engineering education.

S. Catalán, R. Carratalá-Sáez, S. Iserte

Code Generation & Program Synthesis Distributed Systems & Hardware

Georg-August-Universität Göttingen3w ago·also Georg-August-Universität Göttingen /, GWDG mbH

A Treasure Trove of Performance: Analyzing the IO500 Submission Data

HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.

Julian Kunkel, Aasish Kumar Sharma, Anila Ghazanfar +2

Distributed Systems & Hardware Eval Frameworks & Benchmarks

3w ago·also Princeton, Rutgers

AAFLOW: Scalable Patterns for Agentic AI Workflows

Agentic workflows can be sped up by 4.6x, not through faster LLMs, but by optimizing data flow and communication between components.

Arup Kumar Sarker, Mills Staylor, Aymen Alsaadi +3

Distributed Systems & Hardware Tool Use & Agents

Yijiang Li +53w ago

FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

FedQueue tackles the Achilles' heel of federated learning on HPC clusters - unpredictable queue delays - by explicitly modeling and mitigating their impact, leading to significant speedups.

Yijiang Li, Emon Dey, Zilinghan Li +3

Distributed Systems & Hardware Training Efficiency & Optimization

May 3, 2026

3w ago

On the Distortion of Partitioning Performance by Random Quantum Circuits

Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.

Maria Gragera Garces

Distributed Systems & Hardware Eval Frameworks & Benchmarks

University of Sharjah3w ago·also Bologna

Decentralized Stratified Sampling for Low-Latency Approximate Geospatial Data Stream Processing in Edge-Cloud Architectures

Offloading geospatial data sampling to the edge slashes latency and bandwidth costs, achieving cloud-competitive accuracy with 80% less data.

Isam Mashhour Al Jawarneh, Lorenzo Felletti, Luca Foschini +1

Data Curation & Synthetic Data Distributed Systems & Hardware

NVIDIA3w ago·also TAU

nvPAX: Constrained Optimization for Dynamic Power Allocation in Hierarchical and Multi-Tenant Systems

Hierarchical power allocation in datacenters can achieve near-perfect satisfaction ratios, even with oversubscription, by using a novel three-phase QP/LP optimization policy.

Hadar Sivan, Gil Shabat, Yoel Shkolnisky

Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Microsoft Research, Forschungszentrum Jülich GmbH, Snowflake

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.

Mahmoud Ahmed, Sameh Abdulah, Olatunji Ruwase +4

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Yihan Xue +43w ago

Joint Temporal-Structural Representation Learning for Distributed Fault Discrimination in Microservice Architectures

Untangling the chaotic web of microservice failures just got easier: a new model uses temporal graph neural networks to pinpoint faults by jointly learning how services evolve and interact.

Yihan Xue, Yuxiao Wang, Ao Zhu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

Cut KV-cache transfer times by up to 32% with SplitZip, a new GPU-friendly lossless compressor that unlocks faster disaggregated LLM serving.

Yipin Guo, Siddharth Joshi

Distributed Systems & Hardware Inference & Quantization

May 1, 2026

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Apr 30, 2026

Shun Takagi +13w ago

Shuffling-Aware Optimization for Private Vector Mean Estimation

Shuffling data introduces a fundamental shift in the privacy-utility tradeoff for mean estimation, rendering locally differentially private (LDP) mechanisms suboptimal.

Shun Takagi, Seng Pei Liew

Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also BU, Cornell, NTT Physics and Informatics Laboratories

Physical Foundation Models: Fixed hardware implementations of large-scale neural networks

Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.

Logan G. Wright, Tianyu Wang, Tatsuhiro Onodera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also RUC

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.

Yuhua Wang, Qinnan Zhang, Xiaodong Li +6

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning

Foundation model embeddings reveal hidden structure in federated datasets, enabling surprisingly effective client clustering without any training or communication overhead.

Mahad Ali, M. Ali, Laura J. Brattain

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Sivaram Krishnan +63w ago

Toward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach

Managing thousands of LEO satellites just got easier: a novel graph learning approach slashes network management overhead while boosting forecasting accuracy.

Sivaram Krishnan, S. Krishnan, Bassel Al Homssi +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Xubin Luo +13w ago

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework

Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.

Xubin Luo, Yang Cheng

Distributed Systems & Hardware Inference & Quantization

Mohd Sameen Chishti +23w ago

Feature-Centric Methodology for Analyzing Cross-Chain NFT Migration Compatibility

Stop costly cross-chain NFT migrations before they start: a new feature-centric methodology predicts which NFT functionalities will break when moving between blockchains like Ethereum and Solana.

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Dolby Labs

ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

Volumetric videoconferencing doesn't have to freeze and stutter: ReVo recovers up to 32% of lost RGB data and slashes video freezes by 95% using a cross-layer approach.

Ankur Aditya, Diptyaroop Maji, Lingdong Wang +6

Computer Vision Distributed Systems & Hardware

3w ago·also RUC, Westlake

Akita: A High Usability Simulation Framework for Computer Architecture

Frustrated with clunky architecture simulators? Akita offers a breath of fresh air with its focus on developer experience, promising faster prototyping and experimentation.

Sabila Al Jannat, Sabila Al Jannat, Ying Li +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Muhammad Ihsan Al Hafiz +33w ago

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.

Muhammad Ihsan Al Hafiz, Muhammad Ihsan Al Hafiz, Artur Podobas +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago·also ANL

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Cerebras CS-3 can deliver 100x speedups over CPU for sparse matrix multiplication at 90% sparsity, but surprisingly, becomes *slower* than CPU beyond 99% sparsity.

Milan Shah, Sheng Di, Michela Becchi

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Tijn de Vos +43w ago

Distributed Santa Claus via Global Rounding

Even approximately fair gift-giving is surprisingly hard in distributed systems: achieving any approximation for the Santa Claus problem requires $\Omega(\sqrt{n} + D)$ rounds.

Tijn de Vos, Leo Wennmann, Malte Baumecker +2

Distributed Systems & Hardware

MEV-X3w ago·also HSE University, Moscow Institute of Physics and Technology, Skoltech

The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale

Most MEV arbitrage opportunities on Polygon can be traced back to a single transaction, revealing surprising concentration in MEV creation across protocols.

Andrei Seoev, Dmitry Belousov, Anastasiia Smirnova +6

Distributed Systems & Hardware

Jin Xin Ng +103w ago

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Schedulers can boost throughput by 12% on chiplet-based systems simply by treating spatial locality as a first-class objective, even if it means sacrificing work-conservation.

Jin Xin Ng, Ori Livneh, R. O'Grady +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Behnaz Ranjbar +13w ago

AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality Systems

Balancing processor utilization and Quality-of-Service in mixed-criticality systems just got easier with AnTi-MiCS and MulTi-MiCS, which automatically determine optimal low WCETs and improve QoS by up to 30%.

Behnaz Ranjbar, Akash Kumar

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Beijing Academy of Blockchain and Edge, Fudan, Shanghai Academy of Future Internet

Back to the Future: Rethinking Endorsement in Order-Execute Blockchains

Order-execute blockchains can achieve 10x higher throughput in DeFi workloads by embedding flexible endorsement directly into the consensus mechanism, avoiding the high abort rates of execute-order-validate approaches.

Rongji Huang, Yifeng Ye, Gerui Wang +5

Distributed Systems & Hardware

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

A study on the performance of distributed training of data-driven CFD simulations

Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.

Sergio Iserte, A. González-Barberá, Alejandro González-Barberá +25

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Barcelona Supercomputing Center3w ago·also Czestochowa University of Technology, Universitat Jaume I

Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning

Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.

Sergio Iserte, Iker Martín-Álvarez, Iker Martín-Alvarez +75

Distributed Systems & Hardware

3w ago·also Samsung Electronics

AME-PIM: Can Memory be Your Next Tensor Accelerator?

HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.

Emanuele Venieri, Simone Manoni, Alberto Florian +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jisheng Zhao +43w ago

CuLifter: Lifting GPU Binaries to Typed IR

Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.

Jisheng Zhao, Huanzhi Pu, Shinnung Jeong +2

Code Generation & Program Synthesis Distributed Systems & Hardware Inference & Quantization

Yan-Cheng Guo +23w ago

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Zi-Wei Lin +23w ago

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.

Zi-Wei Lin, Zimiao Lin, Tian-Sheuan Chang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

3w ago

WOOTdroid: Whole-system Online On-device Tracing for Android

Android's security-relevant IPC is now traceable on stock devices without app instrumentation, closing a critical visibility gap for security researchers and incident responders.

Simon Althaus, S. Althaus, Nikolaos Alexopoulos +7

Distributed Systems & Hardware

3w ago·also INRIA

Strait: Perceiving Priority and Interference in ML Inference Serving

Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.

Haidong Zhao, Nikolaos Georgantas, Nikolaos Georgantas

Distributed Systems & Hardware Inference & Quantization

3w ago·also Tsinghua AI, PolyU

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.

Zhiqiang Kou, Zhi Kou, Jun Wu +11

Data Curation & Synthetic Data Distributed Systems & Hardware Natural Language Processing

Zhenzhou Jin +33w ago

Statistical Channel Fingerprint Construction for Massive MIMO: A Unified Tensor Learning Framework

Ditch the encoder-decoder: LPWTNet's closed-form Laplacian pyramid decomposition offers efficient inference for statistical channel fingerprint construction in massive MIMO systems.

Zhenzhou Jin, Li You, Xiang-Gen Xia +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Zehui Tang +33w ago·also MIIT Key Laboratory of Pattern Analysis, NJU

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Adaptively weighting defenses in federated learning lets you robustly handle diverse attacks without needing the dataset on the server.

Zehui Tang, Yuchen Liu, F. Huang +1

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

3w ago

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Achieve 100% agent recovery correctness with near-zero overhead by intelligently checkpointing only the OS state that actually matters.

Tianyuan Wu, Chaokun Chang, Chaokun Chang +4

Distributed Systems & Hardware Tool Use & Agents

Wenxiang Lin +53w ago·also HIT

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.

Wenxiang Lin, Xinglin Pan, Ruibo Fan +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3w ago

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.

Jiasheng Zheng, Xin Zheng, Boxi Cao +9

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Apr 29, 2026

Akshay Karjol +13w ago

Real-Time GPU-Accelerated Monte Carlo Evaluation of Safety-Critical AEB Systems Under Uncertainty

Real-time, GPU-accelerated Monte Carlo simulation makes probabilistic safety guarantees for Automatic Emergency Braking systems deployable, not just a validation afterthought.

Akshay Karjol, Shadi Alawneh

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Ahan Gupta +43w ago·also Snowflake

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Training LLMs on ultra-long contexts just got a whole lot easier: AutoSP automates sequence parallelism and activation checkpointing, boosting context length by up to 2.7x with negligible throughput cost.

Ahan Gupta, Zhihao Wang, Neel Dani +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

Adaptive Self-Organization in Anonymous Dynamic Networks

Even with adversarial network changes and only local signals, surprisingly simple distributed algorithms can enable dynamic networks to self-organize and adapt to changing environmental goals.

Garrett Parzych, Joshua J. Daymude

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Barcelona Supercomputing Center3w ago·also UPC

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.

Mar Tejedor, Javier Conejero, Rosa M. Badia

Distributed Systems & Hardware Inference & Quantization

University of Artificial Intelligence3w ago

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

Fixing your parallelism strategy while tuning batch size (or vice versa) leaves performance on the table: COPUS adaptively co-tunes both for faster LLM training.

Akhmed Sakip, Erland Hilman Fuadi, Omar Sayedelahl +6

Distributed Systems & Hardware Training Efficiency & Optimization

Tianhao Hu +163w ago·also HKUST

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.

Tianhao Hu, Xiangcheng Liu, Youshao Xiao +14

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Timothy Flavin +13w ago

A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

Training complex multi-agent RL policies just got 3,500x faster thanks to a new engine that optimizes for memory access and data locality.

Timothy Flavin, Sandip Sen

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

Hyunsung Yoon +33w ago

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.

Hyunsung Yoon, Sungju Ryu, Sungju Ryu +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Barcelona Supercomputing Center (BSC)3w ago

Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

A holistic, industrial-grade V&V loop promises to accelerate and de-risk RISC-V chip design by integrating RTL validation, FPGA-based system-level testing, and continuous integration.

Sajjad Ahmed, Alexander Kropotov, Roberto Ignacio Genovese +21

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Alexander Kropotov +23w ago

EMiX: Emulating Beyond Single-FPGA Limits

Emulating massive multi-core systems just got easier: EMiX lets you scale RISC-V emulation across multiple FPGAs without rewriting your RTL.

Alexander Kropotov, Miquel Moreto, Behzad Salami

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

LLM-Guided Runtime Parameter Optimization for Energy-Efficient Model Inference

Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.

Katelyn Crumpacker, Dimitrios Nikolopoulos

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Ericsson AB3w ago·also KTH

Where did we fail? -- Reproducing build failures in embedded open source software

Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.

Han Fu, Andreas Ermedahl, Sigrid Eldh +3

Code Generation & Program Synthesis Distributed Systems & Hardware Open-Source Models & Weights

Verint Systems Inc3w ago

When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems

Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.

Emma Casey, David Roberts, David Sim +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Nankai University3w ago·also Tsinghua AI, PKU

Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?

Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.

Runzhou Wang, Shenglin Zhang, Wenwei Gu +5

Distributed Systems & Hardware

3w ago

End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric

Overlapping validation and private-data acquisition of successive blocks with state-consistency checks and ledger updates can almost double Hyperledger Fabric's commit throughput.

Pavan Sollu, Aniruddha Mukherjee, Divya Pulivarthi +6

Distributed Systems & Hardware Training Efficiency & Optimization

Tsinghua AI3w ago

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tune massive LLMs like Qwen3-235B with 31K context on a single 8x RTX 4090 server, thanks to a novel pipeline schedule that eliminates the weight binding bottleneck.

Yibin Luo, Yi Luo, Shiwei Gao +3

Distributed Systems & Hardware Training Efficiency & Optimization

Barcelona Supercomputing Center (BSC)3w ago

A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.

Petter Sandås, Íñigo Aréjula-Aísa

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

NVIDIA3w ago

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.

Hayate Iso, Tiyasa Mitra, Sudipta Mondal +22

Distributed Systems & Hardware Inference & Quantization RLHF & Preference Learning+1

Yiqi Liu +43w ago

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.

Yiqi Liu, Noelle Crawford, Michael Wang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Rafael Mayo +13w ago

DMRlib: Easy-coding and Efficient Resource Management for Job Malleability

Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.

Rafael Mayo, Enrique S. Quintana-Ortí

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Bodon Jeong +83w ago

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.

Bodon Jeong, Bodon Jeong, H.I. Byun +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

ETH3w ago·also ANU, Sydney

FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps

Training a 1024-node SOM on a billion-sample dataset in just over 6 minutes shatters previous scalability limits, thanks to a novel framework that leverages multi-GPU execution, out-of-memory streaming, and flexible topologies.

Tony Xu, Sarah Klamt, Katherine Turner +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Toulouse INP

MPI Malleability Validation under Replayed Real-World HPC Conditions

MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.

S. Iserte, M. Madon, G. Da +2

Distributed Systems & Hardware

Anna Golubeva +13w ago

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

Squeeze more out of your hardware: TSP lets you shard both weights and activations across the same devices, unlocking memory savings for long-context training and inference.

Anna Golubeva, Quentin Anthony

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Yimeng Shan +83w ago·also ANL, NJUST

SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning

Fine-tuning LLMs in federated settings just got easier: SplitFT lets clients adapt their cut layers and LoRA ranks, boosting performance and slashing communication costs.

Yimeng Shan, Yimeng Shan, Zhaorui Zhang +6

Distributed Systems & Hardware Natural Language Processing Training Efficiency & Optimization

Aditya Ukarande +73w ago

Efficient, VRAM-Constrained xLM Inference on Clients

Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.

Aditya Ukarande, Aditya Ukarande, Deep Shekhar +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

3w ago

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

NVIDIA's closed-source driver secrets are out: researchers can now see the exact hardware commands triggered by CUDA code.

Yuang Yan, Ian Karlin, Ryan Grant

Distributed Systems & Hardware Open-Source Models & Weights

3w ago·also Georgia Tech

Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

Tomorrow's 6G networks hinge on overcoming the design hurdles of mm-wave and sub-THz oscillators, and this review lays out the roadmap.

Baktash Behmanesh, Ahmad Rezvanitabar

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago·also Vrije Universiteit Amsterdam

What Is the Cost of Energy Monitoring? An Empirical Study on the Overhead of RAPL-Based Tools

Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.

Jeremy Diamond, Vincenzo Stoico

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Central Research Laboratory3w ago

Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware

Quantum Gatekeeper achieves near-perfect information hiding: without all four factors (password, shared secret, context string, and reference image signature), payload extraction fails silently, preventing even partial disclosure.

Sahil Tomar

Computer Vision Distributed Systems & Hardware

Pericle Perazzo +13w ago

Catching the Fly: Practical Challenges in Making Blockchain FlyClient Real

FlyClient, a lightweight blockchain verification protocol, gets closer to real-world deployment with a practical Zcash implementation and proof-size optimizations.

Pericle Perazzo, Dario Capecchi

Distributed Systems & Hardware

University of the Cumberlands3w ago

Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes

Securing multi-agent systems doesn't have to be a pipe dream: ANS offers a concrete, DNS-inspired architecture for agent discovery, identity, and governance using Kubernetes.

Akshay Mittal, Elyson De La Cruz

Constitutional AI & AI Ethics Distributed Systems & Hardware Tool Use & Agents

3w ago

Towards Intelligent Computation Offloading in Dynamic Vehicular Networks: A Scalable Multilayer Pipeline

A modified Particle Swarm Optimization algorithm slashes computation offloading latency in vehicular networks, outperforming brute-force methods in dynamic, real-world scenarios.

Falk Dettinger, Matthias Weiß, Baran Can Gül +3

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

3w ago·also Illinois Institute of Technology

StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

Squeezing more out of 5G video calls is possible: StreamGuard boosts video conferencing quality by up to 70% by intelligently prioritizing different parts of the video stream.

Xuyang Cao, Oliver Michel, Kyle Jamieson

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Apr 28, 2026

Shuchen Zhu +3Apr 28, 2026

Subspace Optimization for Efficient Federated Learning under Heterogeneous Data

Federated learning can achieve better accuracy-efficiency trade-offs under heterogeneous data by optimizing within a low-dimensional subspace and using a backfill-style update to retain residual components.

Shuchen Zhu, Zhengyang Huang, Yuqi Xu +1

Distributed Systems & Hardware Training Efficiency & Optimization

Apr 28, 2026·also Cornell, Stony Brook

GraphPL: Leveraging GNN for Efficient and Robust Modalities Imputation in Patchwork Learning

Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.

Xingjian Hu, Zuoyu Yan, Jianhua Zhu +3

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Apr 28, 2026

Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.

Haoran Chen, Wentao Wang

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Changyu Li +7Apr 28, 2026·also HKUST

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.

Changyu Li, Shuanghong Huang, Jiashen Liu +5

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Apr 28, 2026

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.

Srikanta Prasad S, Utkarsh Arora

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yongtao Yao +5Apr 28, 2026

QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

Quantum-inspired attention networks can significantly improve task offloading performance in MEC networks, offering a practical path to more energy-efficient and sustainable edge computing.

Yongtao Yao, Yao Yang, Haorui Shi +3

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Hala ElAarag +1Apr 28, 2026

Hands-on PDC in Undergraduate Computing Education

Give undergrads supercomputer access, and they'll actually grok parallel computing.

Hala ElAarag, Anas Gamal Aly

Distributed Systems & Hardware Training Efficiency & Optimization

Verdict SecurityApr 28, 2026·also Ain Shams University

Prime-Field PINI: Machine-Checked Composition Theorems for Post-Quantum NTT Masking

Fresh masking between pipeline stages in NTT-based post-quantum crypto isn't just good practice, it's provably necessary to erase vulnerabilities arising from prior stages, as demonstrated with a machine-checked proof and a real-world hardware flaw.

Ray Iskander, Khaled Kirah

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Free University of Bozen-BolzanoApr 28, 2026·also Arizona, Nutrosal Inc.

Key Developer Roles and Organizational Coupling in Microservices: A Longitudinal Analysis

Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.

Xiaozhou Li, Nariman Mani, Jose Sosa Rodriguez +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Apr 28, 2026·also DTU, Verified Systems International GmbH

Scenario-based System Testing for Distributed Robotics Applications

Automating system-level testing for distributed robotics is now more practical with a new language that handles complexity, non-determinism, and dynamic reconfiguration.

Jan Peleska, Felix Brüning, Wen-Ling Huang +1

Code Generation & Program Synthesis Distributed Systems & Hardware Robotics & Embodied AI

Zhouzhi Xiong +4Apr 28, 2026

DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.

Zhouzhi Xiong, Zimou Zeng, Shu Xu +2

Computer Vision Distributed Systems & Hardware Inference & Quantization

Ce Zheng +6Apr 28, 2026·also Pengcheng Laboratory

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.

Ce Zheng, Xinghan Wang, Jiahong Ning +4

Distributed Systems & Hardware Inference & Quantization

P. Bechtle +16Apr 28, 2026·also Bonn, Forschungszentrum Jülich, Heidelberg, KIT

Economical and ecological impact of sector coupling applied to computing clusters

Shifting computing workloads to periods of high renewable energy availability slashes both carbon emissions and operational costs for HPC clusters.

P. Bechtle, Oliver Freyermuth, O. Freyermuth +14

Distributed Systems & Hardware Training Efficiency & Optimization

H. Babak +1Apr 28, 2026

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Unlock significant speedups in depthwise convolutions (up to 3.26x) with optimized CUDA kernels, even in restricted cloud environments lacking hardware performance counters.

H. Babak, Melanie Schaller

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Zihao Xuan +6Apr 28, 2026

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.

Zihao Xuan, Jia Chen, Yewen Li +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Search

Distributed Systems & Hardware - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)