May 1 – May 8, 2026

Distributed Systems & Hardware - Weekly Roundup

70 papers published across 6 labs.

Selected Labs publishing this week

Tsinghua AI2 NVIDIA2 ETH1 CMU ML1 OpenAI1

Top Papers

May 5, 2026

Chris S. Lin +62w ago

GPUBreach: Privilege Escalation Attacks on GPUs using Rowhammer

Rowhammer attacks aren't just for CPUs anymore: a malicious CUDA kernel can now leverage targeted bit flips to achieve root access on a system, even bypassing IOMMU protections.

Chris S. Lin, Yuqin Yan, Guozhen Ding +4

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

May 6, 2026

Chengyi Nie +22w ago

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.

Chengyi Nie, Nian Si, Zijie Zhou

Distributed Systems & Hardware Inference & Quantization

2w ago

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

Training MoE models just got a whole lot faster: Piper achieves up to 3.5x higher MFU by intelligently scheduling pipeline parallelism and optimizing communication.

Sajal Dash, Feiyi Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2w ago

Learned Neighbor Trust for Collaborative Deployment in Model-Agnostic Decentralized Learning

Stop training in isolation: LNTrust lets decentralized models learn *who* to trust during training, so they can collaborate effectively at deployment, boosting accuracy and cutting communication costs.

Michael Lanier, Luise Ge, Sastry Kompella +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

ETH2w ago·also ELLIS, Max Planck

Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

All Papers (70)

May 6, 2026

2w ago

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

Training MoE models just got a whole lot faster: Piper achieves up to 3.5x higher MFU by intelligently scheduling pipeline parallelism and optimizing communication.

Sajal Dash, Feiyi Wang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2w ago

Learned Neighbor Trust for Collaborative Deployment in Model-Agnostic Decentralized Learning

Michael Lanier, Luise Ge, Sastry Kompella +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

ETH2w ago·also ELLIS, Max Planck

Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Han Wang +52w ago·also Tsinghua AI

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.

Han Wang, Jintao Zhang, Kai Jiang +3

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Tsinghua AI2w ago

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.

Junxiang Wu, Zhi Kou, Hongwei Zeng +8

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Leon Witt +42w ago

Knowledge-Free Correlated Agreement for Incentivizing Federated Learning

Incentivizing honest participation in federated learning is now possible without ground truth labels, even when some participants are trying to game the system.

Leon Witt, T. Abbaslı, Kentaroh Toyoda +2

Distributed Systems & Hardware Natural Language Processing Training Efficiency & Optimization

Kang Liu +22w ago

Budget-aware Auto Optimizer Configurator

Fine-tune optimizer precision block-by-block and slash memory use without sacrificing model quality.

Kang Liu, Wei Peng, Jianchen Hu

Distributed Systems & Hardware Training Efficiency & Optimization

Ilan University2w ago

Modular Reinforcement Learning For Cooperative Swarms

Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.

Erel Shtossel, Gal A. Kaminka

Distributed Systems & Hardware RLHF & Preference Learning Robotics & Embodied AI

Chengyi Nie +22w ago

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.

Chengyi Nie, Nian Si, Zijie Zhou

Distributed Systems & Hardware Inference & Quantization

2w ago·also Sydney

Probabilistic Atomic Swaps for Bitcoin and Friends

Atomic swaps can now handle probabilistic exchanges like lotteries and randomized allocations, opening up new possibilities for trustless cross-chain interactions.

Paul Gerhart, Jay Taylor, Sri Aravinda Krishnan Thyagarajan

Distributed Systems & Hardware

2w ago

A Pragmatic Comparison of Cryptographic Computation Technologies for Machine Learning

Choosing between secure multi-party computation (SMPC) and fully homomorphic encryption (FHE) for secure ML depends heavily on the model architecture: FHE excels at regressions and simple networks, while SMPC dominates for complex CNNs.

Marcus Taubert, Adam Skuta, Thomas Loruenser

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

J. Jung +32w ago

Fundamental Limitations of Post-Quantum Cryptographic Architectures

Lattice-based cryptography's reliance on injected noise for security is more akin to hiding secrets under a rug than truly erasing them, leaving them vulnerable to future quantum attacks.

J. Jung, Donghwa Ji, Mingyu Lee +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tsinghua AI2w ago·also Cryptape and Nervos, KU Leuven

Order Flow Exclusivity and Value Extraction Mechanisms: An Analysis of Ethereum Builder Centralization

Ethereum builder centralization isn't just about who has the best order flow, but also about how network effects let incumbents decouple from needing exclusive deals.

Ao Zhang, Yunwen Liu, Ren Zhang +9

Distributed Systems & Hardware

Lingzhe Zhang +82w ago

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.

Lingzhe Zhang, Tong Jia, Yunpeng Zhai +6

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Barkhausen Institut2w ago

Interaction Tree Semantics for RISC-V: Bridging Compiler and Hardware Verification

Proving semantic equivalence between LLVM IR and RISC-V code is now possible within a single framework, thanks to a new formal RISC-V semantics built on Interaction Trees.

Shuanglong Kan, Sebastian Ertel

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Jacob Wahlgren +42w ago

Communication Offloading on SmartNIC DPUs: A Quantitative Approach

Offloading communication to SmartNIC DPUs can speed up host-dominated workloads by 1.55x, but the lack of Direct Cache Access creates a massive DRAM bottleneck.

Jacob Wahlgren, Andong Hu, Roger Pearce +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Mingyu Guo +42w ago

Delay-Aware Large-Small Model Collaboration over LEO Satellite Networks

MARL-optimized collaboration between large and small models in LEO satellites slashes service delays by nearly a third.

Mingyu Guo, Wen Wu, Ying Wang +2

Distributed Systems & Hardware Inference & Quantization

Wenjun Yu +22w ago

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Generative recommenders can slash latency by up to 38% simply by dynamically juggling GPU memory between embedding and KV caches, a feat current systems miss.

Wenjun Yu, Shuguang Han, Amelie Chi Zhou

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

2w ago·also CMU ML

AGIPC: Adaptive In-Solve Algebraic Coarsening for GPU IPC

Implicit time integration on GPUs gets a 3x speed boost thanks to a novel algebraic coarsening method that avoids costly explicit remeshing.

Xuan Wang, Zhaofeng Luo, Minchen Li +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Colorado State University2w ago

MCFlash: Bulk Bitwise Processing in 3D NAND with Dynamic Sensing and Multi-level Encoding

Run billions of bitwise operations directly in your 3D NAND flash, error-free, using just standard instructions.

Habib Ur Rahman, Tharini Suresh, Sudeep Pasricha +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

Not All Faults Are Equal: Transient-Fault Sensitivity Characterization of an Open-Source RISC-V Vector Cluster

Exponent bits are the Achilles' heel of floating-point arithmetic, as corrupting them in RISC-V vector processors leads to the most severe silent data corruption.

M. Cai, Amirhossein Kiamarzi, Davide Rossi +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

M. Zaeemi +12w ago

Ultra Low-Power SDM-based Circuit-Switching for Networks-on-Chip

Radically reduce power consumption in AI chips with a circuit-switched network-on-chip that carves out dedicated "lanes" for predictable communication flows.

M. Zaeemi, Mehdi Modarressi

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Hanum Ko +32w ago

RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs

RangeGuard lets you tolerate 64+ flipped bits in DNN memory using just 16 bits of parity, without sacrificing accuracy.

Hanum Ko, Sang Yeon, Jong Hwan Ko +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

May 5, 2026

Yixuan Mei +102w ago

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Save up to 2.79x on LLM serving costs by intelligently distributing models across a diverse fleet of cloud GPUs.

Yixuan Mei, Zikun Li, Zixuan Chen +8

Distributed Systems & Hardware Inference & Quantization

Chun Yin Chiu2w ago

Revocation-Ready CP-ABE Key Management for Blockchain-Based IoT Data Sharing

Forget trusted online policy enforcement points: this revocation-ready key management layer uses ciphertext key publication to enforce dynamic, multi-user authorization for releasing or using bulk-data decryption keys in blockchain-based IoT data sharing systems.

Chun Yin Chiu

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Erfan Iravani +62w ago

LIPPEN: A Lightweight In-Place Pointer Encryption Architecture for Pointer Integrity

Get strong pointer integrity and confidentiality without metadata overhead: LIPPEN encrypts pointers in-place, turning every pointer into a cryptographically protected block.

Erfan Iravani, Lalit Prasad Peri, Mohannad Ismail +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Melki Bino2w ago

Probabilistic-bit Guided CDCL for SAT Solving using Ising Consensus Assumptions

Stochastic sampling from p-bit Ising models can slash the search effort of CDCL SAT solvers by over 80% on certain problem instances.

Melki Bino

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

BRAC University2w ago

HELO Cryptography: A Lightweight Cryptographic System for Enhancing IoT Security in P2P Data Transmission

A new cryptographic system promises top-level security for IoT gadgets without sacrificing performance, a rare win for resource-constrained devices.

Tahsin Ahmed, Arjita Saha, Arian Nuhan +3

Distributed Systems & Hardware Inference & Quantization

Elisa Bertino +52w ago

Quantum-Resistant Networks: A Review of Primitives, Protocols and Best Practices

The transition to post-quantum cryptography isn't just about swapping algorithms; it demands a complete architectural rethink of networked systems, especially regarding key distribution and management.

Elisa Bertino, R. Kompella, Ashish Kundu +3

Distributed Systems & Hardware

Chris S. Lin +62w ago

GPUBreach: Privilege Escalation Attacks on GPUs using Rowhammer

Rowhammer attacks aren't just for CPUs anymore: a malicious CUDA kernel can now leverage targeted bit flips to achieve root access on a system, even bypassing IOMMU protections.

Chris S. Lin, Yuqin Yan, Guozhen Ding +4

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Pierre Pouliquen +42w ago

Firmware Distribution as Attack Surface: A Security Study of ASIC Cryptocurrency Miners

Publicly available firmware for ASIC cryptocurrency miners is riddled with vulnerabilities, making the distribution mechanism itself a primary attack surface.

Pierre Pouliquen, Hadrien Barral, D. Naccache +2

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Takahiro Ishikawa-Aso +42w ago

ipc_shared_ptr: A Publish/Subscribe-Aware Smart Pointer for Cross-Process Object Lifetime Management

Achieve a 2.9x reduction in end-to-end latency in ROS 2 communication by trading off scalability for simplicity in cross-process object lifetime management.

Takahiro Ishikawa-Aso, Atsushi Yano, K. Imai +2

Distributed Systems & Hardware Robotics & Embodied AI

James Yen +72w ago

Jiao: Bridging Isolation and Customization in Mixed Criticality Robotics

Achieve near order-of-magnitude reduction in tail timing error in mixed-criticality robotics by decoupling safety-critical control from user applications.

James Yen, Zhibai Huang, Zhixiang Wei +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Robotics & Embodied AI

Sushrut Kumar +42w ago

GPU-Accelerated Simulations of Problems with Moving Boundaries and Fluid-Structure Interaction at Extreme Scales

Simulating complex fluid dynamics with moving boundaries just got 20x faster thanks to a new GPU-optimized immersed boundary method.

Sushrut Kumar, Joshua Romero, Jung-Hee Seo +2

Distributed Systems & Hardware Scientific Discovery & Drug Design

Reza Farahani +62w ago

ClusterLess: Deadline-Aware Serverless Workflow Orchestration on Federated Edge Clusters

ClusterLess slashes workflow completion times by up to 40% and nearly doubles deadline satisfaction in federated edge environments, outperforming existing methods.

Reza Farahani, M. Colosi, Ilir Murturi +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

OpenAI2w ago

Resilient AI Supercomputer Networking using MRC and SRv6

AI training jobs can now shrug off network failures that used to halt progress, thanks to a new resilient networking stack deployed at OpenAI and Microsoft.

Joao Araujo, Alex Chow, Mark Handley +150

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2w ago·also TU Wien

Orchestrating Serverless Applications in the Edge Cloud Space Continuum: What Breaks and What is Next?

Serverless orchestration falls apart when you move it to space, but this paper proposes a new architecture to fix it.

H. Malazi, Reza Farahani, Nitinder Mohan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Stefan Fischer +22w ago

phys-MCP: A Control Plane for Heterogeneous Physical Neural Networks

Control heterogeneous physical neural networks—even wetware—with a single orchestration architecture, opening the door to seamless integration with edge-cloud workflows.

Stefan Fischer, Malihe Hariri, Sebastian Otte

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Nick Brown +12w ago

Lifting to tensors when compiling scientific computing workloads for AI Engines

Get up to 40% performance boost and 15% energy savings on scientific computing kernels by offloading OpenMP loops to AMD's AI Engines with minimal code changes.

Nick Brown, Gabriel Rodriguez-Canal

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

Andrzej Lingas2w ago

On Solving Problems of Substantially Super-linear Complexity in $N^{o(1)}$ Rounds in the MPC Model

Sub-logarithmic MPC protocols for super-linear problems are fundamentally limited: you can't cheat time complexity without paying a steep price in local computation.

Andrzej Lingas

Distributed Systems & Hardware

Aaron Jarmusch +12w ago

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Forget simplistic roofline models: these analytical models nail GPU performance prediction on Blackwell and CDNA3 with under 1.5% error.

Aaron Jarmusch, Sunita Chandrasekaran

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2w ago

Implementing True MPI Sessions and Evaluating MPI Initialization Scalability

Ditching the global MPI_COMM_WORLD communicator unlocks significant scalability gains for MPI applications on exascale systems.

Hui Zhou, Kenneth Raffenetti, Yanfei Guo +2

Distributed Systems & Hardware

Mike Mwanje +32w ago

Surviving the Edge: Federated Learning under Networking and Resource Constraints

Standard federated learning deployments can catastrophically fail with just 5-second latency or 50% packet loss, revealing a fundamental mismatch between FL's communication patterns and default TCP configurations.

Mike Mwanje, Okemawo Obadofin, Theophilus A. Benson +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Dragana Grbic2w ago

Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

Analyzing exascale performance bottlenecks just got hundreds of times faster, thanks to a new GPU-accelerated framework that pinpoints congestion and predicts optimization opportunities in scientific workloads.

Dragana Grbic

Distributed Systems & Hardware Training Efficiency & Optimization

H. Sedghani +32w ago

Decentralized Edge Caching under Budget and Storage Constraints: A Game-Theoretic Approach

Storage scarcity in edge caching doesn't just impact performance, it fundamentally shifts the economic landscape, amplifying inequality among content providers.

H. Sedghani, Zahra Seyedi, Mauro Passacantando +1

Distributed Systems & Hardware Recommendation & Information Retrieval

2w ago

SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Forget running the full gauntlet: just 4-5 workloads from SPEC CPU2026 can accurately mirror the entire suite, slashing evaluation costs without sacrificing fidelity.

Ruihao Li, A. Jacob, N. Yadwadkar +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks

NVIDIA2w ago

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.

Chung-Hsuan Tung, Yanxiang Huang, N. Saxena +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Pranav Srinivasan +22w ago

t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)

Formal reasoning about programmable memory hierarchies is now possible, thanks to a new ISA-level memory consistency model that tames the complexity of architectures like t\"{a}k\={o}.

Pranav Srinivasan, Manos Kapritsos, Yatin A. Manerkar

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Xu Zhao +72w ago

Design and Implementation of BNN-Based Object Detection on FPGA

Achieve near-identical object detection results compared to the ONNX model while drastically reducing computational cost by implementing a binarized YOLOv3-tiny on a low-cost FPGA.

Xu Zhao, Yunpeng Wu, Mengyuan Zhu +5

Computer Vision Distributed Systems & Hardware Inference & Quantization

Ahmed F. Ibrahim2w ago

A Multi-Agent Consensus Protocol for Stable Software Remodularization

Guaranteeing software stability during remodularization doesn't require sacrificing performance; a multi-agent consensus protocol can match state-of-the-art optimizers while acting as a "circuit breaker" for strict stability constraints.

Ahmed F. Ibrahim

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

May 4, 2026

2w ago·also USC

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Scientific Discovery & Drug Design Tool Use & Agents

Qipeng Wang +12w ago

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.

Qipeng Wang, Zhendong Yang

Distributed Systems & Hardware Inference & Quantization

2w ago·also USC

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.

Komal Thareja, Anirban Mandal, Ewa Deelman

Distributed Systems & Hardware Robotics & Embodied AI Scientific Discovery & Drug Design

Jenny Lynn Almerol +32w ago·also Studi Avanzati (SISSA)

Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}

RISC-V accelerators, originally for AI, can efficiently run scientific simulations, but only with the right parallelization strategy.

Jenny Lynn Almerol, Elisabetta Boella, Mario Spera +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

2w ago·also University of Jyväskylä Jyväskylä

Distributed Quantum Circuit Optimisation: Evaluating Global and Local encodings

Quantum circuit optimization doesn't always improve distributed execution: sometimes, local optimization surprisingly beats global methods at minimizing communication costs.

Maria Gragera Garces, Majid Haghparast

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2w ago·also HSE University, IDEAS: Inter-Disciplinary & Advanced, Mid Hope Technologies, Moscow Institute of Physics and Technology +2

Caliper-in-the-Loop: Black-Box Optimization for Hyperledger Fabric Performance Tuning

Bayesian optimization can automatically tune Hyperledger Fabric configurations to achieve double-digit throughput improvements, but the impact of measurement noise on interpreting gains cannot be ignored.

Yash Madhwal, Arseny Bolotnikov, Mark Prikhno +5

Distributed Systems & Hardware Training Efficiency & Optimization

Hongbin Zhang +52w ago

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

Commodity GPU servers can achieve surprisingly high LLM inference throughput by cleverly orchestrating pipeline parallelism with KV cache offloading.

Hongbin Zhang, Taosheng Wei, Jiazhi Jiang +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Ahmad Dabaja +12w ago

FedPLT: Scalable, Resource-Efficient, and Heterogeneity-Aware Federated Learning via Partial Layer Training

FedPLT achieves full-model accuracy in federated learning while training up to 82% fewer parameters per client, slashing communication costs and enabling participation from resource-constrained devices.

Ahmad Dabaja, Rachid El-Azouzi

Distributed Systems & Hardware Training Efficiency & Optimization

Mohammadreza Doostmohammadian +12w ago

Distributed Observer-based Fault Detection over Intelligent Networked Multi-Vehicle Systems

CAVs can now detect sensor anomalies in their measurements without relying on a central unit, even when tracking human-driven vehicles that aren't directly observable.

Mohammadreza Doostmohammadian, Hamid R. Rabiee

Distributed Systems & Hardware Robotics & Embodied AI

S. Catalán +22w ago

Leveraging Teaching on Demand: Approaching HPC to Undergrads

Hands-on experience with Raspberry Pi clusters and student-driven learning can effectively bridge the HPC skills gap in undergraduate engineering education.

S. Catalán, R. Carratalá-Sáez, S. Iserte

Code Generation & Program Synthesis Distributed Systems & Hardware

Georg-August-Universität Göttingen /2w ago

A Treasure Trove of Performance: Analyzing the IO500 Submission Data

HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.

Julian Kunkel, Aasish Kumar Sharma, Anila Ghazanfar +2

Distributed Systems & Hardware Eval Frameworks & Benchmarks

2w ago·also Princeton, Rutgers

AAFLOW: Scalable Patterns for Agentic AI Workflows

Agentic workflows can be sped up by 4.6x, not through faster LLMs, but by optimizing data flow and communication between components.

Arup Kumar Sarker, Mills Staylor, Aymen Alsaadi +3

Distributed Systems & Hardware Tool Use & Agents

Yijiang Li +52w ago

FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

FedQueue tackles the Achilles' heel of federated learning on HPC clusters - unpredictable queue delays - by explicitly modeling and mitigating their impact, leading to significant speedups.

Yijiang Li, Emon Dey, Zilinghan Li +3

Distributed Systems & Hardware Training Efficiency & Optimization

May 3, 2026

2w ago

On the Distortion of Partitioning Performance by Random Quantum Circuits

Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.

Maria Gragera Garces

Distributed Systems & Hardware Eval Frameworks & Benchmarks

University of Sharjah2w ago·also Bologna

Decentralized Stratified Sampling for Low-Latency Approximate Geospatial Data Stream Processing in Edge-Cloud Architectures

Offloading geospatial data sampling to the edge slashes latency and bandwidth costs, achieving cloud-competitive accuracy with 80% less data.

Isam Mashhour Al Jawarneh, Lorenzo Felletti, Luca Foschini +1

Data Curation & Synthetic Data Distributed Systems & Hardware

NVIDIA2w ago·also TAU

nvPAX: Constrained Optimization for Dynamic Power Allocation in Hierarchical and Multi-Tenant Systems

Hierarchical power allocation in datacenters can achieve near-perfect satisfaction ratios, even with oversubscription, by using a novel three-phase QP/LP optimization policy.

Hadar Sivan, Gil Shabat, Yoel Shkolnisky

Distributed Systems & Hardware Training Efficiency & Optimization

2w ago·also Microsoft Research, Forschungszentrum Jülich GmbH, Snowflake

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.

Mahmoud Ahmed, Sameh Abdulah, Olatunji Ruwase +4

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Yihan Xue +42w ago

Joint Temporal-Structural Representation Learning for Distributed Fault Discrimination in Microservice Architectures

Untangling the chaotic web of microservice failures just got easier: a new model uses temporal graph neural networks to pinpoint faults by jointly learning how services evolve and interact.

Yihan Xue, Yuxiao Wang, Ao Zhu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

Cut KV-cache transfer times by up to 32% with SplitZip, a new GPU-friendly lossless compressor that unlocks faster disaggregated LLM serving.

Yipin Guo, Siddharth Joshi

Distributed Systems & Hardware Inference & Quantization

May 1, 2026

Zi-Bo Qin +43w ago

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.

Zi-Bo Qin, Zijian Qin, Feng-Feng Wei +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Search

Distributed Systems & Hardware - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (70)