March 11 – March 18, 2026

Distributed Systems & Hardware - Weekly Roundup

100 papers published across 7 labs.

26% acceleration

Selected Labs publishing this week

NVIDIA3 Tsinghua AI2 BAIR1 Microsoft Research1 CMU ML1

Top Papers

Mar 16, 2026

University of Applied Sciences Upper Austria2w ago·also Embedded Systems Lab, Software Competence Center Hagenberg GmbH (SCCH), University of Applied Sciences Upper

Comparative Analysis of SRAM PUF Temperature Susceptibility on Embedded Systems

Even closely related microcontrollers exhibit drastically different SRAM PUF performance under varying temperatures, underscoring the need for careful hardware selection.

Martina Zeinzinger, J. Langer, Josef Langer +7

Distributed Systems & Hardware

Mar 18, 2026

H. Haynes +12w ago

The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation

Unlock geometric algebra's performance potential in neural networks and spatial computing by compiling directly from multi-way relationships, eliminating manual specialization and ensuring geometric correctness.

H. Haynes, Houston Haynes

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Mar 17, 2026

2w ago·also Charlie, McGill

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Distributed Systems & Hardware Inference & Quantization

Mar 18, 2026

2w ago·also MAGNET

DDH-based schemes for multi-party Function Secret Sharing

Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.

Marc Damie, Florian Hahn, Andreas Peter +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Mar 17, 2026

2w ago·also Charlie, McGill

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

All Papers (100)

Mar 18, 2026

Charuka Herath +32w ago

QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.

Charuka Herath, Yogachandran Rahulamathavan, Varuna De Silva +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Sei Labs2w ago·also University of Portsmouth

A mechanism design overview of Sedna

Sedna, a promising consensus protocol, is surprisingly vulnerable to cartel attacks that can stall block production and extract MEV, but a clever bounty mechanism can restore its security.

Benjamin Marsh, Alejandro Ranchal-Pedrosa

Distributed Systems & Hardware

2w ago

Bringing Network Coding into Multi-Robot Systems: Interplay Study for Autonomous Systems over Wireless Communications

Network coding, often overlooked in robotics, can drastically improve the reliability and timeliness of multi-robot communication, outperforming traditional retransmission methods in safety-critical scenarios.

Anil Zaher, Kiril Solovey, Alejandro Cohen

Distributed Systems & Hardware Natural Language Processing Robotics & Embodied AI

Qubit Pharmaceuticals2w ago·also Qubit Pharmaceuticals Inc, Sorbonne

The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery

Quantum computers could finally unlock the full potential of machine learning for drug discovery by directly generating the quantum chemistry data that classical computers struggle to produce.

Narjes Ansari, César Feniou, C'esar Feniou +17

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

2w ago·also UTS

Learning Evolving Preferences: A Federated Continual Framework for User-Centric Recommendation

Federated recommendation systems can now better adapt to evolving user preferences without sacrificing privacy, thanks to a novel approach that retains historical knowledge and transfers insights between similar users.

Chunxu Zhang, Zhi Xue, Guodong Long +2

Distributed Systems & Hardware Recommendation & Information Retrieval Training Efficiency & Optimization

Zeeshan Akram2w ago

Circumventing Platform Defenses at Scale: Automated Content Replication from YouTube to Blockchain-Based Decentralized Storage

YouTube's platform defenses are a house of cards: circumventing one control often triggers a cascade of failures, demanding constant architectural adaptation for large-scale content replication.

Zeeshan Akram

Data Curation & Synthetic Data Distributed Systems & Hardware Open-Source Models & Weights

2w ago

Manufacturing Micro-Patterned Surfaces with Multi-Robot Systems

Ergodic control lets swarms of robots cooperatively manufacture micro-patterned surfaces, unlocking scalable production of materials with enhanced physical properties.

Annalisa T. Taylor, Malachi Landis, Ping Guo +1

Distributed Systems & Hardware Robotics & Embodied AI

2w ago·also Charlie

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.

Huamin Chen, Xunzhuo Liu, Yuhan Liu +3

Distributed Systems & Hardware Inference & Quantization Scaling Laws & Emergent Abilities

Vladyslav Mikytiv +22w ago

In Perfect Harmony: Orchestrating Causality in Actor-Based Systems

Automatically tracking causality across actors exposes hidden behavioral violations in real-world Erlang systems, without requiring manual code modifications.

Vladyslav Mikytiv, Bernardo Toninho, Carla Ferreira

Distributed Systems & Hardware

Zhengze Xiao +42w ago·also H6 and C

A Survey of Neural Network Variational Monte Carlo from a Computing Workload Characterization Perspective

NNVMC's promise for solving quantum many-body problems is currently bottlenecked by surprisingly mundane issues: low-intensity elementwise operations and data movement on GPUs.

Zhengze Xiao, Xuanzhe Ding, Yuyang Lou +2

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

Jie Lei +42w ago

Enabling RISC-V Vector Code Generation in MLIR through Custom xDSL Lowerings

Achieve up to 2.4x speedup over OpenBLAS on RISC-V by using MLIR and xDSL to generate optimized RVV code, finally unlocking the potential of RISC-V vector extensions.

Jie Lei, Héctor Martínez, H. Mart'inez +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Arpit Singh Gautam +12w ago

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.

Arpit Singh Gautam, Saurabh Jha

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also HIT

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.

Ruibo Fan, Xiangrui Yu, Xinglin Pan +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

H. Haynes +12w ago

The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation

H. Haynes, Houston Haynes

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

K. Chin +62w ago

Huddle: Parallel Shape Assembly using Decentralized, Minimalistic Robots

Forget centralized control: this algorithm lets swarms of robots build complex shapes with only local communication and no global positioning.

K. Chin, Khai Yi Chin, Tingwei Meng +4

Distributed Systems & Hardware Robotics & Embodied AI

Panuganti Chirag Sai +92w ago

ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

Achieve significant latency and energy savings in memory systems with an RL-based controller that also provides insights into *why* its decisions are optimal.

Panuganti Chirag Sai, Panuganti Chirag Sai, Gandholi Sarat +7

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Tim Oh2w ago

A Synthesizable RTL Implementation of Predictive Coding Networks

Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.

Tim Oh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

BAIR2w ago·also Microsoft Research

Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads

Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.

Sara Pohland, Sara Pohland, Xenofon Foukas +10

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

2w ago·also Huawei, Northwestern, ZJU

SpiderCam: Low-Power Snapshot Depth from Differential Defocus

SpiderCam shatters power consumption barriers for FPGA-based 3D cameras, achieving sub-Watt operation while maintaining real-time performance.

Marcos A. Ferreira, Tianao Li, John Mamish +3

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

David Millard +42w ago·also Manuscript received xxx xx

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

By federating distributional critics and using a Wasserstein barycenter trust region, TR-FedDistRL avoids the dangerous "mean-smearing" that can make federated RL unsafe in critical applications.

David Millard, Cecilia Alm, Rashid Ali +2

Distributed Systems & Hardware Training Efficiency & Optimization

J. Clelland +12w ago

Bonsai: A class of effective methods for independent sampling of graph partitions

Independent sampling of graph partitions is now a practical alternative to MCMC, offering a new path for generating diverse redistricting plans.

J. Clelland, Kristopher Tapp

Distributed Systems & Hardware

Ruhr University Bochum2w ago·also GMV Spain, NEC Laboratories Europe

On Securing the Software Development Lifecycle in IoT RISC-V Trusted Execution Environments

Secure enclave updates and migrations, previously missing from RISC-V TEEs, are now practical thanks to a novel toolkit that adds minimal overhead.

Annika Wilde, Samira Briongos, Claudio Soriente +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

2w ago·also MAGNET

DDH-based schemes for multi-party Function Secret Sharing

Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.

Marc Damie, Florian Hahn, Andreas Peter +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Dalhousie University2w ago

CodeGreen: Towards Improving Precision and Portability in Software Energy Measurement

Finally, a software energy profiler achieves both high accuracy and cross-platform portability, enabling practical algorithmic energy optimization across diverse languages and hardware.

Saurabhsingh Rajput, Tushar Sharma

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

2w ago

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ditch the polar decomposition: MUD offers a surprisingly simple and efficient alternative for momentum whitening, speeding up transformer training by up to 50% compared to AdamW and Muon.

Ben S. Southworth, Stephen Thomas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Zirui Gong +72w ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Even without architectural modifications, a new gradient inversion attack, ARES, can reconstruct high-fidelity training samples in federated learning, exposing a significant privacy risk.

Zirui Gong, Leo Yu Zhang, Yanjun Zhang +5

Constitutional AI & AI Ethics Distributed Systems & Hardware Red-Teaming & Adversarial Robustness+1

2w ago·also Ruhr University Bochum

SoK: From Silicon to Netlist and Beyond $-$ Two Decades of Hardware Reverse Engineering Research

Reproducibility in hardware reverse engineering is shockingly low, with only 4% of evaluated artifacts from 187 papers yielding reproducible results.

Zehra Karadağ, Zehra Karadaug, Simon Klix +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago

Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design

Federated Computing as Code lets you enforce data sovereignty in federated systems with cryptographic guarantees, moving beyond runtime policies and trust assumptions.

Enzo Fenoglio, Enzo Fenoglio, Philip Treleaven +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

2w ago

Multi-stage Flow Scheduling for LLM Serving

LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.

Yijun Sun, Xudong Liao, Songrun Xie +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tuowei Wang +32w ago

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.

Tuowei Wang, Liyun Chu, Ruwen Fan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Md. Mehedi Hasan +42w ago·also Noakhali Science and Technology

ReDAG-RT: Global Rate-Priority Scheduling for Real-Time Multi-DAG Execution in ROS 2

ROS 2's real-time performance gets a major boost with ReDAG-RT, a user-space scheduler that cuts deadline misses by up to 30% without touching the core ROS 2 API.

Md. Mehedi Hasan, Rafid Mostafiz, Bikash Kumar Paul +2

Distributed Systems & Hardware Robotics & Embodied AI

Mar 17, 2026

2w ago

Biased Compression in Gradient Coding for Distributed Learning

Biased compression, previously overlooked in distributed learning with gradient coding, can actually boost performance when combined with error feedback to mitigate straggler effects and reduce communication costs.

Chengxi Li, Ming Xiao, Mikael Skoglund

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Catalunya (CTTC/CERCA)2w ago

Toward Experimentation-as-a-Service in 5G/6G: The Plaza6G Prototype for AI-Assisted Trials

Forget wrestling with 5G/6G testbeds – Plaza6G lets you design and run wireless experiments with natural language, thanks to an LLM-powered assistant.

Sergio Barrachina-Muñoz, Marc Carrascosa-Zamacois, Horacio Bleda +6

Distributed Systems & Hardware

2w ago

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Fine-tune 123B+ parameter models on a single RTX 4090 with SlideFormer, a system that achieves up to 6x larger models and 8x larger batch sizes.

Ruijia Yang, Zeyi Wen

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Junyi Liu +52w ago

A Scalable Open-Source QEC System with Sub-Microsecond Decoding-Feedback Latency

Achieve sub-microsecond decoding-feedback latency in a scalable, open-source QEC system, bringing fault-tolerant quantum computation closer to reality.

Junyi Liu, Yi-Che Lee, Yi Lee +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Qiujiang Liang +32w ago

Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

Achieve near-linear scaling and 40x speedup for MP2 calculations on large molecules by unleashing multi-GPU parallelism for local correlation methods.

Qiujiang Liang, Qi-Yuan Liang, Jun Yang +1

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

2w ago

FastLoop: Parallel Loop Closing with GPU-Acceleration in Visual SLAM

Visual SLAM loop closure just got a whole lot faster: FastLoop achieves up to 3x speedups by unleashing the power of GPU parallelism.

Soudabeh Mohammadhashemi, Shishir Gopinath, Kimia Khabiri +3

Computer Vision Distributed Systems & Hardware Robotics & Embodied AI

2w ago·also Charlie, McGill

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Distributed Systems & Hardware Inference & Quantization

Alexander Zuepke +42w ago

ETM2: Empowering Traditional Memory Bandwidth Regulation using ETM

An existing debugging tool, the Arm Embedded Trace Macrocell (ETM), can be surprisingly repurposed to create a portable and effective hardware-assisted memory bandwidth regulator.

Alexander Zuepke, Ashutosh Pradhan, Daniele Ottaviano +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago·also Tsinghua AI, Hangzhou Dianzi University, NTU, PKU

Resource Consumption Threats in Large Language Models

Resource consumption vulnerabilities in LLMs can lead to degraded service availability and economic sustainability, demanding a systematic understanding and mitigation approach.

Yuanhe Zhang, Yuanhe Zhang, Xinyue Wang +17

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Charlie, McGill

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.

Xunzhuo Liu, Yuhan Liu, Junchen Jiang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

City University of New York2w ago

Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

A novel DRL approach can extend XR device battery life by 163% without sacrificing real-time responsiveness, offering a practical solution to the energy-latency trade-off in immersive applications.

Sourya Saha, Saptarshi Debroy

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Vito Daniele Perfetta +62w ago

Reconciling distributed compliance with high-performance control in continuum soft robotics

Forget stiff, piecewise designs: this soft robot arm achieves 4x faster dynamic task execution than previous approaches, proving that high-performance control and full compliance *can* coexist.

Vito Daniele Perfetta, Daniel Feliú Talegon, Daniel Feliu Talegon +4

Distributed Systems & Hardware Robotics & Embodied AI

Mikkel Bengtson Albrechtsen +42w ago

GitOps for Capture the Flag Platforms

GitOps can transform CTF management, enabling automated deployments, enhanced collaboration, and cost-effective scaling.

Mikkel Bengtson Albrechtsen, Jacopo Mauro, J. Mauro +2

Code Generation & Program Synthesis Distributed Systems & Hardware

Hamish Alsop +32w ago

Ember: A Serverless Peer-to-Peer End-to-End Encrypted Messaging System over an IPv6 Mesh Network

A serverless, peer-to-peer messaging system achieves end-to-end encryption and data minimization, demonstrating a practical alternative to centralized messaging platforms.

Hamish Alsop, Leandros A. Maglaras, Leandros Maglaras +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Kosuke Higuchi +12w ago

Impact of File-Open Hook Points on Backup Ratio in ROFBS on XFS

Hooking the filesystem-specific `xfs_file_open` callback in ROFBS can significantly reduce ransomware damage on XFS filesystems, outperforming other generic file-open hooks.

Kosuke Higuchi, Ryotaro Kobayashi

Distributed Systems & Hardware

Marios Aristodemou +42w ago

Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems

A novel MARL algorithm, DS-PPO, enables multi-satellite systems to maximize user sum-rate despite outdated channel state information, offering a practical solution for robust global connectivity.

Marios Aristodemou, Yasaman Omid, Sangarapillai Lambotharan +2

Distributed Systems & Hardware Training Efficiency & Optimization

Reek Das +12w ago

Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

Forget hand-tuned defenses: a meta-learned aggregation strategy dynamically shields federated learning from a wide range of Byzantine attacks, even ones it's never seen before.

Reek Das, Biplab Kanti Sen

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Andrea Moleri +62w ago·also CSHL

FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios

Forget relying on pretrained models or complex aggregation schemes: FederatedFactory achieves near-centralized performance in federated learning with extreme data heterogeneity by simply swapping generative priors.

Andrea Moleri, Christian Internò, Ali Raza +4

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Elisabetta Boella +92w ago

Accelerating the Particle-In-Cell code ECsim with OpenACC

A pragma-based OpenACC acceleration strategy delivers a 5x speedup and 3x energy reduction for the ECsim Particle-In-Cell code, proving its readiness for exascale plasma simulations.

Elisabetta Boella, E. Boella, Nitin Shukla +7

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

2w ago

Fanar 2.0: Arabic Generative AI Stack

Resource-constrained Arabic AI development can compete with systems built at far greater scale, as demonstrated by Fanar 2.0's performance gains using 8x fewer pre-training tokens than its predecessor.

FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad +33

Data Curation & Synthetic Data Distributed Systems & Hardware Natural Language Processing

2w ago

Early-Terminable Energy-Safe Iterative Coupling for Parallel Simulation of Port-Hamiltonian Systems

Achieve energy-consistent parallel simulations of robotic systems with provable passivity guarantees, even with limited computational resources, by using a novel iterative coupling scheme.

Qi Wei, Jianfeng Tao, Hongyu Nie +1

Distributed Systems & Hardware Robotics & Embodied AI Training Efficiency & Optimization

NVIDIA2w ago·also BOSS Zhipin, ByteDance, Tencent AI, Vipshop

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.

Haidong Rong, Jiashu Yao, Matthias Langer +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

MetaX2w ago·also NJU, USTC

InCoder-32B: Code Foundation Model for Industrial Scenarios

A new 32B code LLM trained specifically for industrial tasks crushes existing models on specialized domains like chip design and GPU kernel optimization, while remaining competitive on general coding benchmarks.

Jian Yang, Wei Zhang, Jiajun Wu +30

Code Generation & Program Synthesis Distributed Systems & Hardware Open-Source Models & Weights

Vassilios Tsounis +112w ago

Kamino: GPU-based Massively Parallel Simulation of Multi-Body Systems with Challenging Topologies

Forget kinematic tree approximations: Kamino unlocks high-fidelity, massively parallel robot simulations with closed kinematic chains directly on GPUs.

Vassilios Tsounis, Guirec Maloisel, Christian Schumacher +9

Distributed Systems & Hardware Robotics & Embodied AI World Models & Planning

Premanand Seralathan2w ago

Persistent Device Identity for Network Access Control in the Era of MAC Address Randomization: A RADIUS-Based Framework

Enterprises can regain control over network access in the age of MAC address randomization using a RADIUS-based framework that maintains persistent device identity without OS modifications.

Premanand Seralathan

Distributed Systems & Hardware

2w ago

Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

A simple orthogonal rotation of the activation space makes LLMs virtually immune to bit-flip attacks, even against targeted single-point faults.

Deng Liu, Song Chen, Songcan Chen

Distributed Systems & Hardware Inference & Quantization Red-Teaming & Adversarial Robustness

Mahsa Tahghigh +12w ago

Cross-Scale Persistence Analysis of EM Side-Channels for Reference-Free Detection of Always-On Hardware Trojans

Always-on hardware Trojans leave persistent statistical signatures in EM emissions that can be detected without a golden reference, even differentiating between workload-correlated and independent Trojans.

Mahsa Tahghigh, Hassan Salmani

Distributed Systems & Hardware

2w ago·also Max Planck

SAMSEM -- A Generic and Scalable Approach for IC Metal Line Segmentation

IC verification just got a whole lot easier: SAMSEM can segment metal lines in SEM images with surprisingly low error rates, even on unseen ICs.

Christian Gehrmann, Jonas Ricker, Simon Damm +5

Computer Vision Distributed Systems & Hardware

Maria Fernanda Oiveira Guimaraes +62w ago·also Cadence, Independent Researcher

Vectorization of Verilog Designs and its Effects on Verification and Synthesis

Vectorizing Verilog designs slashes memory consumption by over 50% in formal verification, even without changing the underlying hardware.

Maria Fernanda Oiveira Guimaraes, U. Rosa, Ian Trudel +4

Code Generation & Program Synthesis Distributed Systems & Hardware

Enguang Fan +42w ago

Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

UAV swarms can achieve near-optimal cooperative deployment and generalize to new team sizes using a communication-aware MARL approach, even with limited communication and partial observability.

Enguang Fan, Yifan Chen, Zihan Shan +2

Distributed Systems & Hardware Robotics & Embodied AI Tool Use & Agents

Xavier Gonzalez +12w ago

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Parallelizing sequential computations like RNNs is now more feasible thanks to new scalable and stable parallel Newton methods, along with a theoretical understanding of when such parallelization provably accelerates computation.

Xavier Gonzalez, X. González

Distributed Systems & Hardware Training Efficiency & Optimization

Ai Nozaki +32w ago

Dataflow-Oriented Classification and Performance Analysis of GPU-Accelerated Homomorphic Encryption

Blindly applying GPU optimizations to homomorphic encryption can leave nearly 2x performance on the table, as the best strategy hinges on CKKS parameters and GPU architecture.

Ai Nozaki, Takuya Kojima, Hideki Takase +1

Distributed Systems & Hardware Inference & Quantization

2w ago·also Intel Labs

ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

Replay-driven validation slashes CPU-GPU integration time in chiplet architectures, enabling full system boot and workload execution in a single quarter.

Nij Dorairaj, D. Chatterjee, Debabrata Chatterjee +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Mar 16, 2026

Nitin Priyadarshini Shankar +32w ago

Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference

Binary neural networks can now be trained effectively in federated settings, offering a path to low-cost, privacy-preserving edge inference without sacrificing accuracy.

Nitin Priyadarshini Shankar, Soham Lahiri, Sheetal Kalyani +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

L. Krupp +52w ago

This Is Taking Too Long - Investigating Time as a Proxy for Energy Consumption of LLMs

Inference time can reveal the GPU models behind black-box LLM APIs, offering a way to estimate their hidden energy costs.

L. Krupp, Daniel Geissler, Francisco M. Calatrava-Nicolás +3

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

2w ago·also Utah

DP-S4S: Accurate and Scalable Select-Join-Aggregate Query Processing with User-Level Differential Privacy

Sampling the wrong data in differentially private queries can inflate error by 10x, but a new method slashes that overhead by sampling aggregation units instead of users.

Yuan Qiu, Xiaokui Xiao, Yin Yang

Data Curation & Synthetic Data Distributed Systems & Hardware

University of Applied Sciences Upper Austria2w ago·also Embedded Systems Lab, Software Competence Center Hagenberg GmbH (SCCH), University of Applied Sciences Upper

Comparative Analysis of SRAM PUF Temperature Susceptibility on Embedded Systems

Even closely related microcontrollers exhibit drastically different SRAM PUF performance under varying temperatures, underscoring the need for careful hardware selection.

Martina Zeinzinger, J. Langer, Josef Langer +7

Distributed Systems & Hardware

NVIDIA2w ago·also Earendil Labs *Core contributor, Proxima, Rezo Therapeutics

Fold-CP: A Context Parallelism Framework for Biomolecular Modeling

Now you can predict the structure of biomolecular assemblies exceeding 30,000 residues, thanks to a new context parallelism framework that shatters previous memory constraints.

Dejun Lin, Simon K. S. Chu, Simon Chu +46

Distributed Systems & Hardware Scientific Discovery & Drug Design Training Efficiency & Optimization

2w ago

DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning

Federated reinforcement learning can now handle heterogeneous, adversarial IoT environments with near-zero deadline violations, thanks to a novel decentralized framework that transfers knowledge across silos.

Zhiyu Wang, Mohammad Goudarzi, Mingming Gong +2

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

2w ago·also Qi An Xin Technology Group Inc.

vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing

Worried about compromised cloud environments skewing your endpoint auditing? vCause offers a verifiable causality analysis system with negligible overhead.

Qiyang Song, Qihang Zhou, Xiaoqi Jia +5

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

2w ago

LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

Forget complex combinators: a simple multiplication trick can slash LLM latency by 92% and boost throughput by 21%, outperforming production schedulers.

Kaixi Zhang, Rong Chen

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Qiugang Zhan +42w ago

SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning

Overcome resource constraints in federated learning by enabling clients to train spiking neural networks of varying sizes and aggregate their knowledge effectively.

Qiugang Zhan, Shantian Yang, Xiurui Xie +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Yanghao Li +22w ago

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Achieve faster, Byzantine-robust distributed learning by combining double momentum with variance reduction, eliminating the need for large batch sizes.

Yanghao Li, Changxin Liu, Yuhao Yi

Distributed Systems & Hardware Training Efficiency & Optimization

2w ago

Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems

Achieve up to 50% energy savings and 80% latency reduction in edge-based object detection by intelligently balancing load across heterogeneous devices, even with a minor accuracy trade-off.

Daghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema +2

Computer Vision Distributed Systems & Hardware

Pedro Antunes +12w ago

bitSMM: A bit-Serial Matrix Multiplication Accelerator

For spacecraft-bound neural networks, a new bit-serial matrix multiplication accelerator, bitSMM, delivers impressive GOPS/W on both FPGA and ASIC, promising efficient on-board inference.

Pedro Antunes, Artur Podobas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago·also Corresponding author:, SJTU

Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing

Achieve near-ideal GPU sharing without kernel hacks: DetShare guarantees semantic and performance determinism through GPU coroutines and lightweight context migration.

Zhenyuan Yang, Wenxin Zheng, Mingyu Li

Distributed Systems & Hardware Inference & Quantization

Johannes Gutenberg University Mainz2w ago

Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs

Cuckoo filters on GPUs can now achieve performance rivaling append-only Bloom filters, thanks to a novel lock-free architecture and memory access optimization strategy that closes the gap between static and dynamic approximate membership query structures.

Tim Dortmann, Markus Vieth, B. Schmidt +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

V. Parakhin +12w ago

Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

Multi-agent LLM systems can slash synchronization costs by up to 95% by borrowing cache coherence strategies from chip design.

V. Parakhin, Vladyslav Parakhin

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

CMU ML2w ago·also UW-Madison

LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs

LLMs can run up to 35% faster on chiplet architectures thanks to a new lossless exponent compression technique that slashes inter-chiplet communication overhead.

Miao Sun, Alish Kanani, Kaushik Shroff +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

J. Sarkar +42w ago

Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models

Interpretable machine learning unlocks holistic, data-driven design of SSDs, enabling continuous architectural advancements across memory generations.

J. Sarkar, Jay Sarkar, Vamsi Pavan Rayaprolu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Tsinghua AI2w ago·also ByteDance, SJTU

Mixture-of-Depths Attention

LLMs can now scale depth more effectively: a new attention mechanism recovers diluted features in deeper layers, boosting performance with negligible overhead.

Lianghui Zhu, Yuxin Fang, Bencheng Liao +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

T. Ruiz +42w ago·also East China University of Science and Technology

FlashSampling: Fast and Memory-Efficient Exact Sampling

Exact sampling in large-vocabulary decoding can be sped up by 19% simply by fusing it into the LM-head matmul, turning a bandwidth bottleneck into a lightweight epilogue.

T. Ruiz, Zhen Qin, Xuyang Shen +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago·also Orange

DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning

Database tuning just got easier: DOT dynamically identifies and optimizes key parameters on-the-fly, outperforming existing methods without the need for costly warm-up phases.

Yifan Wang, Debabrota Basu, Pierre Bourhis +2

Distributed Systems & Hardware Training Efficiency & Optimization

Meta AI2w ago·also Mila

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

Forget exotic attention mechanisms – MobileLLM-Flash achieves up to 1.8x faster LLM prefill on mobile CPUs by smartly pruning and adapting existing architectures for on-device use.

Igor Fedorov, Andrey Gromov, B. Beckerman +12

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks

Squeezing federated learning through bandwidth-constrained networks? This routing and pruning method boosts accuracy by 12% while slashing latency by 28%.

Xiaoyu He, Weicai Li, Tiejun Lv +1

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Jérémy Morlier +62w ago

MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers

MONET reveals the potential for significant hardware architecture improvements by modeling and optimizing neural network training, a domain often overshadowed by inference-centric design.

Jérémy Morlier, Robin Geens, Stef Cuyckens +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Yuya Okada +12w ago

Lightweight User-Personalization Method for Closed Split Computing

SALT offers a surprisingly effective way to personalize and harden split computing models in closed environments, using a lightweight adapter that outperforms full fine-tuning while slashing training costs.

Yuya Okada, Takayuki Nishio

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

University of California2w ago

Can you keep a secret? A new protocol for sender-side enforcement of causal message delivery

Cykas lets long-running distributed jobs start and end sooner by cleverly shifting causal delivery enforcement from senders to receivers.

Yanxiang Tong, Yan Tong, Nathan Liittschwager +1

Distributed Systems & Hardware

2w ago·also Eastern Institute of Technology, Institute of Digital Twin, Ningbo Key Laboratory of Spatial

SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

FPGAs can beat GPUs at dynamically allocating computation for LLM inference, thanks to a new architecture that fuses operations, uses mixed precision, and caches KV values on-chip.

Zicheng He, Anhao Zhao, Xiaoyu Shen +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Jing Yan +32w ago

Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity

Neuromorphic systems can achieve deterministic computation despite temporal stochasticity by enforcing charge conservation, enabling a direct mapping to quantized ANNs.

Jing Yan, Kang You, Zhezhi He +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Hybrid Mamba-Transformer models can get 4x faster time to first token and 1.4x higher throughput by disaggregating prefill and decode phases onto specialized accelerator packages.

Alish Kanani, Sang-Won Lee, Sangwan Lee +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

Protecting Distributed Blockchain with Twin-Field Quantum Key Distribution: A Quantum Resistant Approach

Twin-field QKD slashes the infrastructure complexity of quantum-secured blockchains from quadratic to linear scaling, paving the way for practical, long-distance deployments.

Xuan Li, Ying Guo

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Mar 15, 2026

NVIDIA2w ago

Domain-Skewed Federated Learning with Feature Decoupling and Calibration

Domain skew in federated learning can be tamed by decoupling and calibrating domain-specific features, leading to more consistent and generalizable global models.

Jun Shen, Jun Yan, Guansong Pang

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Aydin Abadi +12w ago

Oblivis: A Framework for Delegated and Efficient Oblivious Transfer

Oblivis enables practical, privacy-preserving database queries in cloud and edge settings, achieving up to 10^6x speedups over standard Oblivious Transfer methods.

Aydin Abadi, Yvo Desmedt

Distributed Systems & Hardware Recommendation & Information Retrieval

2w ago·also Tsinghua AI

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Stop wasting compute: Sharing KV caches across tasks and time can make Vision-Language-Action models run 3.7x faster.

Xiangyu Li, Huaizhi Tang, Weijun Wang +1

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

José Peixoto +62w ago

Idiosyncrasies of Programmable Caching Engines

CacheLib, a popular caching engine, buckles under dynamic multi-tenant workloads, revealing critical limitations in adaptability and fairness that demand a rethink of its design.

José Peixoto, Alexis Gonzalez, Janki Bhimani +4

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

2w ago

Invited: Toward Accurate, Large-scale Electromigration Analysis and Optimization in Integrated Systems

Rule-based electromigration checks are no longer sufficient; physics-based models are ready for prime time, but several open problems must be solved to enable their practical adoption in integrated circuit design.

Sachin S. Sapatnekar

Distributed Systems & Hardware

2w ago

Committee Configuration Optimization for Parallel Byzantine Consensus in a Trusted Execution Environment

Optimizing committee configurations with mixed integer programming can boost transaction throughput in trusted parallel BFT systems by up to 21%, outperforming randomized assignment.

Yifei Xie, Btissam Er-Rahmadi, Xiao Chen +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Ahmad N. L. Nabhaan +62w ago

Covariance-Guided Resource Adaptive Learning for Efficient Edge Inference

Achieve near-optimal power-efficient deep learning inference on edge devices without the need for expensive and repeated offline profiling, thanks to a novel online optimization method.

Ahmad N. L. Nabhaan, Zaki Sukma, Rakandhiya D. Rachmanto +4

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Search

Distributed Systems & Hardware - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)