March 25 – April 1, 2026

Distributed Systems & Hardware - Weekly Roundup

72 papers published across 1 lab.

26% acceleration

Selected Labs publishing this week

CMU ML1

Top Papers

Mar 31, 2026

Gianluca Aguzzi +31d ago

Phyelds: A Pythonic Framework for Aggregate Computing

Pythonistas rejoice: aggregate programming, a powerful paradigm for distributed systems, finally gets a first-class, easy-to-use library in your favorite language.

Gianluca Aguzzi, Davide Domini, Nicolas Farabegoli +1

Code Generation & Program Synthesis Distributed Systems & Hardware Robotics & Embodied AI

Derek Anderson +151d ago

Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

Automating detector design with AI can dramatically speed up scientific discovery by intelligently exploring complex parameter spaces.

Derek Anderson, Amit Bashyal, Markus Diefenthaler +13

Distributed Systems & Hardware Scientific Discovery & Drug Design

Luigi Altamura +41d ago

SISA: A Scale-In Systolic Array for GEMM Acceleration

LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.

Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kavindu Herath +21d ago

Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.

Kavindu Herath, Joshua Zhao, Saurabh Bagchi

Computer Vision Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Claudius Pott +31d ago

HPCCFA: Leveraging Hardware Performance Counters for Control Flow Attestation

Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.

Claudius Pott, Luca Wilke, Jan Wichelmann +1

Distributed Systems & Hardware Inference & Quantization

All Papers (72)

Mar 31, 2026

Gianluca Aguzzi +31d ago

Phyelds: A Pythonic Framework for Aggregate Computing

Pythonistas rejoice: aggregate programming, a powerful paradigm for distributed systems, finally gets a first-class, easy-to-use library in your favorite language.

Gianluca Aguzzi, Davide Domini, Nicolas Farabegoli +1

Code Generation & Program Synthesis Distributed Systems & Hardware Robotics & Embodied AI

Derek Anderson +151d ago

Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

Automating detector design with AI can dramatically speed up scientific discovery by intelligently exploring complex parameter spaces.

Derek Anderson, Amit Bashyal, Markus Diefenthaler +13

Distributed Systems & Hardware Scientific Discovery & Drug Design

Luigi Altamura +41d ago

SISA: A Scale-In Systolic Array for GEMM Acceleration

LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.

Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kavindu Herath +21d ago

Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.

Kavindu Herath, Joshua Zhao, Saurabh Bagchi

Computer Vision Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Claudius Pott +31d ago

HPCCFA: Leveraging Hardware Performance Counters for Control Flow Attestation

Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.

Claudius Pott, Luca Wilke, Jan Wichelmann +1

Distributed Systems & Hardware Inference & Quantization

Yuhua Xu +71d ago

Client-Verifiable and Efficient Federated Unlearning in Low-Altitude Wireless Networks

Now, clients can actually *verify* that their data has been removed from a federated learning model, even when the server is untrusted.

Yuhua Xu, Mingtao Jiang, Chenfei Hu +5

Distributed Systems & Hardware Training Efficiency & Optimization

Dimitris Gkoulis1d ago

A Lightweight Hybrid Publish/Subscribe Event Fabric for IPC and Modular Distributed Systems

Achieve structured IPC and practical message movement in modular services with CNS, a lightweight hybrid event fabric that bridges in-process and inter-node communication with minimal overhead.

Dimitris Gkoulis

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Ni Gao +31d ago

FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion

Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.

Ni Gao, Siquan Huang, Leyu Shi +1

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Jonas Ohnemus +41d ago

Distributed Predictive Control Barrier Functions: Towards Scalable Safety Certification in Modular Multi-Agent Systems

Guaranteeing safety in multi-agent systems with dynamic networks doesn't have to sacrifice performance: this plug-and-play protocol ensures recoverable safety even when agents join/leave or network topologies shift.

Jonas Ohnemus, Alexandre Didier, Ahmed Aboudonia +2

Distributed Systems & Hardware Robotics & Embodied AI

CMU ML1d ago·also UT Austin

A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations

Achieve HPC acceleration by emulating FP64 operations with INT8 precision on GPUs, proving that you can boost performance *and* accuracy.

Hang Liu, Junjie Li, Yinzhi Wang +2

Distributed Systems & Hardware Inference & Quantization Scientific Discovery & Drug Design

1d ago

Efficient Parallel Compilation and Profiling of Quantum Circuits at Large Scales

Quantum circuit compilation, a major bottleneck, can be sped up by over 15x with minimal overhead using a new parallelization technique validated on 8000 large-scale, configurable random circuits.

Jane Moore, Michael Hart, John McAllister

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Vrije Universiteit Amsterdam1d ago

M3SA: Exploring Datacenter Performance and Climate-Impact with Multi- and Meta-Model Simulation and Analysis

Datacenter simulations can now combine multiple independent models to better predict performance and climate impact, addressing limitations of single-model approaches.

Radu Nicolae, Dante Niewenhuis, Sacheendra Talluri +1

Distributed Systems & Hardware Scientific Discovery & Drug Design World Models & Planning

Muhamed Ramees Cheriya Mukkolakkal1d ago

1.5 Million Messages Per Second on 3 Machines: Benchmarking and Latency Optimization of Apache Pulsar at Enterprise Scale

Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.

Muhamed Ramees Cheriya Mukkolakkal

Distributed Systems & Hardware Eval Frameworks & Benchmarks

1d ago·also Paris-Saclay, Tata Institute of Fundamental Research

Polynomial Time Local Decision Revisited

Sometimes, knowing less (limiting computation to polynomial time) can let you decide *more* in distributed systems, especially with universal certificates.

L. Feuilloley, Soumyadeep Paul, A. Paz

Distributed Systems & Hardware

Abrarul Karim +21d ago

Exploration of Energy and Throughput Tradeoffs for Dataflow Networks

Dataflow networks can achieve significant energy savings without sacrificing throughput by strategically powering down actors during idle periods, a balance efficiently discovered using a novel "Hop and Skip" exploration strategy.

Abrarul Karim, J. Falk, J. Teich

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Yusheng Zheng +101d ago

SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training

Pinpointing performance bottlenecks in large-scale AI training just got 100x faster, thanks to a new system that watches the whole stack without slowing things down.

Yusheng Zheng, Wenan Mao, Shuyi Cheng +8

Distributed Systems & Hardware Training Efficiency & Optimization

Karan Pathak +21d ago

CXLRAMSim v1.0: System-Level Exploration of CXL Memory Expander Cards

Finally, a gem5-integrated simulator that accurately models CXL memory expansion for LLMs, capturing real-world effects like cache pollution.

Karan Pathak, David Atienza, Marina Zapater

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Enlai Li +31d ago

AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

Achieve up to 4.17x speedup in DRL training by intelligently partitioning tasks across CPUs, FPGAs, and AI Engines on AMD Versal ACAP, demonstrating the power of hardware-aware algorithm design.

Enlai Li, Zhe Lin, Sharad Sinha +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

M. Gharib +21d ago

From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks

Unlock 600,000x faster TSV design by replacing computationally expensive full-wave simulations with physics-informed graph neural networks.

M. Gharib, Leonid Popryho, Inna Partin-Vaisband

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Zehao Zhou +71d ago

GPU Accelerated Minimal Auxiliary Basis Approach TDDFT for Large Organic Molecules

Calculating excited states of molecules with thousands of atoms, previously a computational bottleneck, is now practical on a single GPU thanks to a new implementation of TDDFT-risp.

Zehao Zhou, Xiaojie Wu, Yanheng Li +5

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Cláudio Modesto +51d ago

LoRaWAN Gateway Placement for Network Planning Using Ray Tracing-based Channel Models

Optimized LoRaWAN gateway placement hinges on the channel model used, with ray tracing offering higher fidelity but at a significant computational cost.

Cláudio Modesto, Lucas Mozart, Glauco Gonccalves +3

Distributed Systems & Hardware

Mingkun Tan +31d ago

Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.

Mingkun Tan, Xilu Wang, Michael Kloster +1

Computer Vision Data Curation & Synthetic Data Distributed Systems & Hardware

ZITiS1d ago·also TU Munich

5G Puppeteer: Chaining Hidden Command and Control Channels in 5G Core Networks

Compromised 5G networks can be weaponized with chained, undetectable command and control channels, enabling attacks that bypass existing security measures.

Julian Sturm, Daniel Fraunholz, Oliver Zeidler +2

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Mar 30, 2026

Yuanqiao Zhang +62d ago

Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

Second-order federated learning can be made robust and practical: FedRCO overcomes instability issues and outperforms first-order methods in non-IID settings.

Yuanqiao Zhang, Tiantian He, Yixin Wang +4

Distributed Systems & Hardware Training Efficiency & Optimization

Muhammed Öz +52d ago

Differentiable Power-Flow Optimization

Differentiable Power-Flow unlocks scalable, gradient-based optimization for power grid management, outperforming traditional methods and enabling new applications like real-time contingency analysis.

Muhammed Öz, Jasmin Hörter, Kaleb Phipps +3

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Shoujin Wang +82d ago

Neural Federated Learning for Livestock Growth Prediction

Federated learning can overcome data sparsity and privacy concerns to improve livestock growth prediction using real-world farm data.

Shoujin Wang, Mingze Ni, Wei Liu +6

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Zili Zhang +72d ago·also Pitt

Heddle: A Distributed Orchestration System for Agentic RL Rollout

Agentic RL rollouts are bottlenecked by long-tail trajectory generation, but Heddle's trajectory-centric approach achieves 2.5x higher throughput.

Zili Zhang, Yinmin Zhong, Chengxu Yang +5

Distributed Systems & Hardware Tool Use & Agents Training Efficiency & Optimization

2d ago

FedDES: Graph-Based Dynamic Ensemble Selection for Personalized Federated Learning

FedDES achieves instance-level personalization in federated learning by dynamically selecting and weighting peer models with a GNN, leading to significant performance gains in heterogeneous environments.

Brianna Mueller, W. Nick Street

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Chanh Nguyen +22d ago

Trust-Aware Routing for Distributed Generative AI Inference at the Edge

Guaranteeing robust distributed GenAI inference at the edge requires trust-aware routing, and G-TRAC achieves this with sub-millisecond routing latency.

Chanh Nguyen, Erik Elmroth, Erik Elmroth

Distributed Systems & Hardware Inference & Quantization

Ricardo Alves Faval +52d ago

Empowering Mobile Networks Security Resilience by using Post-Quantum Cryptography

Quantum-proofing your 5G core doesn't have to break the bank: a sidecar proxy can add post-quantum cryptography with a predictable 50ms latency hit.

Ricardo Alves Faval, Ricardo Alves Faval, Rodrigo Moreira +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Liang Sun2d ago

Object Detection Based on Distributed Convolutional Neural Networks

Lightweight DisCNNs offer a surprisingly efficient route to object detection by exploiting monotonic relationships between network outputs and feature presence.

Liang Sun

Architecture Design (Transformers, SSMs, MoE)Computer Vision Distributed Systems & Hardware

2d ago·also Huawei, Ningbo University

Varuna: Enabling Failure-Type Aware RDMA Failover

RDMA failover can be made significantly more efficient and correct by selectively retransmitting only the requests that were actually lost during a link failure, avoiding redundant retransmissions and semantic violations.

Xiaoyang Wang, Yongkun Li, Lulu Yao +9

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Dominik Walter +52d ago

Loop Control Management in Tightly Coupled Processor Arrays (TCPAs)

Squeezing loop control down to <10% of array resources unlocks near-zero-overhead parallel loop acceleration on Tightly Coupled Processor Arrays.

Dominik Walter, Dominik Walter, Frank Hannig +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Siqing Fu +82d ago

MCPT-Solver: An Monte Carlo Algorithm Solver Using MTJ Devices for Particle Transport Problems

Forget CPUs and GPUs: MCPT-Solver uses spintronics and Bayesian inference to create a hardware random number generator that dramatically accelerates Monte Carlo particle transport simulations.

Siqing Fu, Lizhou Wu, Tiejun Li +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

2d ago·also Fudan, MetaX

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

LLMs can now automatically evolve and optimize GPU kernels to beat hand-tuned and proprietary models like Gemini and Claude.

He Du, Qiming Ge, Jiakai Hu +22

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Jack Cook +122d ago

Adaptive Block-Scaled Data Types

By cleverly repurposing an unused sign bit, IF4 achieves superior quantization performance compared to NVFP4 without increasing bit-width.

Jack Cook, Jack Cook, Hyemin S. Lee +10

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Zhongping Ji2d ago

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.

Zhongping Ji

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Leon Witt +42d ago

Democratizing Federated Learning with Blockchain and Multi-Task Peer Prediction

Blockchain-based federated learning can be made practical by using multi-task peer prediction to overcome the computational bottleneck of contribution measurement.

Leon Witt, Kentaroh Toyoda, Wojciech Samek +2

Distributed Systems & Hardware Open-Source Models & Weights Training Efficiency & Optimization

Oliver Aleksander Larsen +52d ago

BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure

Bitcoin can be more than just digital gold: BitSov proposes a composable architecture for a censorship-resistant internet, anchored to Bitcoin's blockchain, that could reshape how we build decentralized applications.

Oliver Aleksander Larsen, Oliver Aleksander Larsen, Rasmus Thorsen Larsen +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Osama Wehbi +112d ago

FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning

Backdoor defenses can be baked into the pre-training phase of federated learning, achieving state-of-the-art attack mitigation with minimal impact on clean accuracy.

Osama Wehbi, Osama Wehbi, Sarhad Arisdakessian +9

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Osama Wehbi +62d ago

Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory

FedBBA slashes backdoor attack success rates to as low as 1.1% in federated learning, leaving existing defenses in the dust.

Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab +4

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

2d ago

Silent Guardians: Independent and Secure Decision Tree Evaluation Without Chatter

Achieve secure outsourced decision tree evaluation without any communication between servers, unlocking faster and more scalable MLaaS deployments.

Jinyuan Li, L. Zhang, Liang Feng Zhang

Distributed Systems & Hardware Inference & Quantization

Ruiyang Wang +22d ago

FedFG: Privacy-Preserving and Robust Federated Learning via Flow-Matching Generation

Flow-matching generative models can simultaneously defend against poisoning attacks and preserve privacy in federated learning, outperforming existing methods in accuracy and robustness.

Ruiyang Wang, Rong Pan, Zhengan Yao

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Jingyuan Chen +82d ago

Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos

Pinpointing root causes in distributed systems just got easier: Lumos automatically exposes the computational history of bugs with low overhead, even with limited bug occurrences.

Jingyuan Chen, Lei Zhang, Leon Schuermann +6

Code Generation & Program Synthesis Distributed Systems & Hardware

Oliver Aleksander Larsen +32d ago

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Forget hand-coding adapters: this middleware uses LLMs to automatically bridge REST APIs, GraphQL endpoints, and IoT devices with a 90% success rate.

Oliver Aleksander Larsen, Oliver Aleksander Larsen, M. T. Moghaddam +1

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

Zih-Sing Fu +32d ago

Gleanmer: A 6 mW SoC for Real-Time 3D Gaussian Occupancy Mapping

Real-time 3D occupancy mapping for edge devices is now possible under a 6mW power budget thanks to Gleanmer, a novel SoC.

Zih-Sing Fu, Peter Zhi Xuan Li, S. Karaman +1

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

NTT2d ago

SteelDB: Diagnosing Kernel-Space Bottlenecks in Cloud OLTP Databases

Cloud databases are leaving performance on the table: optimizing kernel-space I/O can yield up to 9x speedups without requiring kernel or database patches.

Mitsumasa Kondo

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Research Computing and Data2d ago

Building the Palmetto API: Adding granular permissions and caching to the Slurm REST API without sacrificing compatibility

Securing and accelerating Slurm cluster access is now possible without rewriting existing tools, thanks to a lightweight proxy that adds granular permissions and caching.

Ben Godfrey, Douglas Dawson

Code Generation & Program Synthesis Distributed Systems & Hardware

Zifan He +32d ago

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

LLM inference bottlenecks aren't just compute-bound: heterogeneous GPU-FPGA systems can slash memory processing overheads by up to 2x while simultaneously reducing energy consumption.

Zifan He, Rui Ma, Yizhou Sun +1

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Maxime Flin +42d ago

Sublogarithmic Distributed Vertex Coloring with Optimal Number of Colors

Distributed vertex coloring can now be solved in near-optimal $\tilde{O}(\log^4 \log n)$ rounds, closing the gap with the theoretical lower bound and exponentially improving performance for graphs with small maximum degree.

Maxime Flin, Magn'us M. Halld'orsson, Magnús M. Halldórsson +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Soutrik Mukherjee +22d ago

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Deploying transformers in real-time just got a whole lot faster: this work achieves up to 64x speedups on GPUs while maintaining accuracy through a novel hybrid precision approach.

Soutrik Mukherjee, S. Cha, Sangwhan Cha

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tiantian Wang +22d ago

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

Forget fixed memory budgets: dynamically allocating exemplar storage across federated clients boosts performance in class-incremental learning for heterogeneous medical data.

Tiantian Wang, Xiang Xiang, Simon S. Du

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

2d ago

Warp-STAR: High-performance, Differentiable GPU-Accelerated Static Timing Analysis through Warp-oriented Parallel Orchestration

Intra-warp load imbalance, a major bottleneck in GPU-accelerated Electronic Design Automation, can be eliminated through warp-level parallel orchestration, leading to significant speedups in static timing analysis.

En-Ming Huang, Shih-Hao Hung

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Jérémie Chalopin +52d ago

Efficient Counting and Simulation in Content-Oblivious Rings

Content-oblivious networks can count and simulate message passing far more efficiently than previously thought, shrinking the pulse complexity from $O(n^3)$ to $O(n \log^2 n)$ for counting and $O(b)$ per process for message simulation.

Jérémie Chalopin, J'er'emie Chalopin, Yi-Jun Chang +3

Distributed Systems & Hardware

Rongyu Zhang +152d ago·also NJU

Key-Embedded Privacy for Decentralized AI in Biomedical Omics

Achieve strong, controllable privacy in federated biomedical AI without sacrificing performance, thanks to a lightweight key-embedded implicit neural representation.

Rongyu Zhang, Hongyu Dong, Gaole Dai +13

Constitutional AI & AI Ethics Data Curation & Synthetic Data Distributed Systems & Hardware+1

KMA Solaiman +62d ago

Pre-Deployment Complexity Estimation for Federated Perception Systems

Save time and resources: predict federated learning performance *before* deployment by quantifying dataset and client complexity.

KMA Solaiman, K. Solaiman, Shafkat Islam +4

Computer Vision Distributed Systems & Hardware Training Efficiency & Optimization

Jin Zhang +42d ago

YUHENG-OS: A Cloud-Native Space Cluster Operating System

A space-tailored OS blows Kubernetes out of the water in task completion by nearly 100%, thanks to smarter resource awareness in fragmented, network-constrained environments.

Jin Zhang, Jiachen Sun, Kai Liu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2d ago·also University of Colorado

Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling

Differentiable optimization can supercharge classical ILP solvers, slashing runtime by 10x on combinatorial scheduling problems.

Mingju Liu, Jiaqi Yin, Alvaro Velasquez +1

Distributed Systems & Hardware Training Efficiency & Optimization

Abdullah Azhar +52d ago

Physical Design of UET-RVMCU: A Streamlined Open-Source RISC-V Microcontroller

Open-source RISC-V microcontrollers are now easier to build, thanks to a streamlined design and fully open RTL-to-GDS flow.

Abdullah Azhar, A. Azhar, Uneeb Kamal +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Daniel Gutiérrez +72d ago

AceleradorSNN: A Neuromorphic Cognitive System Integrating Spiking Neural Networks and DynamicImage Signal Processing on FPGA

Achieve high-speed, low-latency object detection in autonomous systems by integrating spiking neural networks and dynamic image signal processing on an FPGA.

Daniel Gutiérrez, Daniel Gutierrez, R. Martinez +5

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

Sijie Fei +32d ago

OptINC: Optical In-Network-Computing for Scalable Distributed Learning

Training large models without communication overhead is now plausible: OptINC uses optical interconnects to perform gradient averaging and quantization directly in the network.

Sijie Fei, Grace Li Zhang, Bing Li +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

2d ago·also China Mobile

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Forget GPU-centric All-Reduce: SCIN's switch-based architecture slashes latency by up to 8.7x and boosts LLaMA-2 performance by 34% through in-network quantization.

Aojie Jiang, Kang Zhu, Zhiheng Zhang +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

* Corresponding author2d ago

AXON: An Automated Netlist Optimization Framework for High-Speed Adders

Achieve up to 32.1% energy-delay product improvement in high-speed adders by co-optimizing prefix topology and standard cell mapping, outperforming commercial synthesis tools.

Tiantian Yang, Xuanle Ren, Qingdian Wan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Mar 29, 2026

Eduardo Brito +43d ago

Decentralized Proof-of-Location for Content Provenance: Towards Capture-Time Authenticity

Forget relying on centralized trust: a decentralized witnessing-zone architecture can boost sensor data trustworthiness against fabricated events.

Eduardo Brito, Fernando Castillo, Amnir Hadachi +2

Distributed Systems & Hardware Robotics & Embodied AI

Sergey Lesnik +23d ago

The First OpenFOAM HPC Challenge (OHC-1)

Optimizing OpenFOAM with GPU ports and selective-memory techniques slashes energy consumption by 28% and iteration time by 72% compared to purely hardware-focused approaches.

Sergey Lesnik, Gregor Olenik, Mark Wassermann

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Mohamed Amine Bergach3d ago

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Apple's own vDSP FFT library gets smoked by a new implementation that's 29% faster, thanks to a clever two-tier memory model exploiting the GPU's register file and threadgroup memory.

Mohamed Amine Bergach

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

3d ago

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Ternary LLMs can run up to 62x faster on CPU and 1.9x faster on CUDA with RSR-core, a new engine that finally brings theoretically fast low-bit matrix multiplication to practical hardware.

Mohsen Dehghankar, Abolfazl Asudeh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

The George Washington University3d ago

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Switching HPC schedulers mid-lifecycle doesn't have to break everything: a carefully staged transition can dramatically improve queue times and user adoption.

Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal +2

Distributed Systems & Hardware

3d ago·also PolyChord Ltd.

jaxsgp4: GPU-accelerated mega-constellation propagation with batch parallelism

Propagating mega-constellations is now 1500x faster thanks to a JAX-based SGP4 reimplementation, making large-scale collision avoidance tractable.

Charlotte Priestley, Will Handley

Distributed Systems & Hardware Scientific Discovery & Drug Design

Hasan Mahmud Rhidoy +13d ago

Optimising Blockchain Scalability for Real-Time IoT Applications

Current blockchain scalability solutions often fall short of meeting the stringent real-time demands of IoT applications, highlighting the need for adaptive and AI-driven approaches.

Hasan Mahmud Rhidoy, Mahdi H. Miraz

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Pranav M R +33d ago

BLOSSOM: Block-wise Federated Learning Over Shared and Sparse Observed Modalities

Multimodal federated learning can finally handle the messy reality of missing data with BLOSSOM's block-wise personalization, boosting performance by up to 37.7% compared to naive aggregation.

Pranav M R, Jayant Chandwani, Ahmed M. Abdelmoniem +1

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Songchen Ma +103d ago

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

Multi-chiplet architectures can unlock significant speedups and memory savings for low-batch MoE inference by dynamically scheduling expert computations across high-bandwidth die-to-die links.

Songchen Ma, Hongyi Li, Weihao Zhang +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Search

Distributed Systems & Hardware - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (72)