Distributed Systems & Hardware
InfrastructureDistributed training, model parallelism, AI accelerator design, and large-scale compute infrastructure.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces SMAPPO, a scalable multi-agent reinforcement learning framework for decentralized multi-robot management in multi-machine tending scenarios. SMAPPO employs a novel observation encoder to achieve input-size invariance, enabling it to handle varying numbers of agents, machines, and storage areas without retraining. Experiments demonstrate that SMAPPO outperforms MAPPO in full retraining, curriculum learning, zero-shot generalization, and adaptability under low initial training, showing significant improvements in productivity, collision avoidance, and parts delivery.
Introduces a novel observation encoder for MAPPO that enables zero-shot generalization to variable numbers of agents and machines in multi-agent reinforcement learning.
The paper introduces Differentially Private Perturbed Push-Sum (DPPS), a protocol-level differential privacy mechanism for decentralized communication networks that addresses the challenge of sensitivity estimation in each round by having nodes broadcast a single scalar. DPPS is then integrated into PartPSP, a privacy-preserving decentralized algorithm for non-convex optimization, which partitions model parameters into local and shared components and applies DPPS only to the shared parameters to reduce noise. Theoretical analysis and experimental results demonstrate that PartPSP achieves better optimization performance under the same privacy budget compared to existing methods.
Introduces a novel sensitivity estimation mechanism for protocol-level differential privacy in decentralized networks, enabling a lightweight and generalizable privacy-preserving communication protocol.
This paper investigates the impact of differential privacy (DP) mechanisms, namely gradient clipping and noise injection, on firing rate statistics within federated spiking neural networks (SNNs). The study demonstrates that DP significantly perturbs firing rates, leading to rate shifts, attenuated aggregation, and unstable client selection in a speech recognition task under non-IID data. The authors further link these rate shifts to sparsity and memory usage, providing insights into the trade-offs between privacy and performance in rate-based federated neuromorphic learning.
Quantifies the sensitivity of firing rate-based federated spiking neural networks to differential privacy mechanisms, revealing specific impacts on rate statistics, aggregation, and client selection.
The paper introduces RouterXBench, a comprehensive evaluation framework for LLM routers, addressing limitations of existing benchmarks by considering router ability, scenario alignment, and cross-domain robustness. They propose ProbeDirichlet, a novel router that leverages internal hidden states and learnable Dirichlet distributions for probabilistic training, capturing model uncertainty more effectively than methods relying on output probabilities or external embeddings. Empirical results demonstrate that ProbeDirichlet outperforms existing routers, achieving significant improvements in router ability and high-accuracy scenarios, while exhibiting robust generalization across diverse model families, scales, tasks, and workflows.
Introduces ProbeDirichlet, a router that aggregates cross-layer hidden states via learnable Dirichlet distributions for improved uncertainty estimation and routing decisions.
The paper introduces MUSE, a multi-tenant model serving framework designed to address the challenge of threshold recalibration in Score-as-a-Service environments caused by model updates. MUSE decouples model scores from client decision boundaries using dynamic intent-based routing and a two-level score transformation to map model outputs to a stable reference distribution. Deployed at Feedzai, MUSE significantly reduces model lead time from weeks to minutes, processing over a thousand events per second across dozens of tenants, leading to substantial savings in fraud losses and operational costs.
Introduces a multi-tenant model serving framework, MUSE, that enables seamless model updates by decoupling model scores from client decision boundaries through dynamic intent-based routing and score transformation.
This paper investigates the latency overhead introduced by enabling optional security controls on disaggregated 5G Radio Access Network (RAN) interfaces and the user plane. The authors implemented a testbed with a disaggregated RAN and standardized security mechanisms to measure the impact of cryptographic operations on latency. Results indicate that while disaggregated RANs maintain a latency advantage over monolithic designs even with security enabled, achieving sub-1ms round-trip times is difficult due to the cryptographic overhead.
Quantifies the latency overhead of optional security mechanisms in a disaggregated 5G RAN, demonstrating the trade-offs between security and ultra-low latency.
This paper introduces an enhanced anonymity architecture based on the Loopix mix-network, tailored for the challenges of LEO satellite constellations and mixed-trust environments. The architecture incorporates a multi-path transport protocol using (n, k) erasure codes for reliability, a computationally efficient Private Information Retrieval (PIR) protocol for route discovery, and adaptive, centrality-based delay strategies to mitigate topological bias. Packet-level simulations validate the architecture, demonstrating near-zero message loss with the multi-path transport and quantifying the overhead of the PIR protocol, showing a practical anonymity-to-latency trade-off.
Introduces a novel anonymity architecture for LEO satellite constellations that integrates multi-path transport, PIR-based route discovery, and adaptive delay strategies to enhance reliability and privacy.
This paper presents a production-grade architecture for a distributed rate limiting system using Redis and Lua scripting, focusing on the trade-offs between accuracy and memory cost. It compares the Rolling Window algorithm's performance against Token Bucket and Fixed Window algorithms, demonstrating its accuracy with manageable memory overhead. The system employs a three-layer architecture for managing and updating rate-limiting rules, deployed on a Redis Cluster for availability and scalability.
Quantifies the accuracy and memory cost trade-off of the Rolling Window rate limiting algorithm compared to Token Bucket and Fixed Window algorithms within a production system.
This paper introduces RooflineBench, a benchmarking framework for on-device LLMs based on the Roofline model, using operational intensity (OI) to unify architectural primitives and hardware constraints. They define an inference-potential region and introduce Relative Inference Potential to compare LLM efficiency on the same hardware. Empirical analysis reveals that sequence length significantly influences performance and OI, identifies OI regression with model depth, and demonstrates how structural refinements like M-LA can unlock inference potential.
Introduces RooflineBench, a novel benchmarking framework leveraging Roofline analysis and operational intensity to evaluate and optimize on-device LLM performance across diverse hardware platforms.
The paper introduces MING, an MLIR-based framework for automating the HLS design process of CNNs targeting resource-constrained edge FPGAs. MING employs a streaming architecture with optimized buffer management to address the limitations of existing frameworks in handling stringent resource constraints. Experiments demonstrate that MING achieves significant speedups (15x for multi-layer CNN kernels and up to 200x for single-layer kernels) and can generate efficient designs for larger input sizes where other frameworks fail.
Introduces an MLIR-based framework, MING, that automates HLS design for CNNs on resource-constrained edge FPGAs using a streaming architecture with optimized buffer management.
This paper investigates the Contention Resolution problem, exploring the impact of a global clock on protocol latency. They present a new protocol with latency $$O\left(\left(n\log\log n\log^{(3)} n\log^{(4)} n\cdots \log^{(\log^* n)} n\right)\cdot 2^{\log^* n}\right)$$, demonstrating a significant complexity gap compared to local clock protocols. Furthermore, they establish a separation between expected latency and high-probability latency for memoryless protocols and prove the impossibility of simultaneously optimizing both metrics.
Establishes a roughly log(n) complexity gap between randomized Contention Resolution protocols operating with a global clock versus a local clock.
The paper introduces PASCAL, a phase-aware scheduling algorithm designed to optimize the serving of reasoning-based LLMs by explicitly differentiating and prioritizing the reasoning phase to minimize Time-To-First-Token (TTFT). PASCAL employs a hierarchical scheduler with instance-level placement, intra-instance execution management, and dynamic migration at phase boundaries to balance load and reduce interference. Experiments using DeepSeek-R1-Distill-Qwen-32B show that PASCAL reduces tail TTFT by up to 72% while preserving answering phase SLO attainment, highlighting the benefits of phase-aware scheduling.
Introduces a phase-aware scheduling algorithm, PASCAL, that optimizes LLM serving by prioritizing the reasoning phase to reduce TTFT and employing controlled preemption and token pacing during the answering phase to maintain QoE.
The paper introduces OServe, a novel LLM serving system designed to address spatial and temporal heterogeneity in LLM workloads by enabling heterogeneous and flexible model deployments. OServe employs a workload-aware scheduling algorithm to optimize model deployment based on real-time workload characteristics and uses a workload-adaptive switching method to migrate model deployments in response to predicted workload changes. Experiments using real-world traces demonstrate that OServe achieves up to a 2x (average 1.5x) performance improvement compared to existing LLM serving systems.
Introduces a spatial-temporal workload orchestration framework, OServe, that dynamically adapts model deployment to heterogeneous and time-varying LLM workloads.
This paper addresses the challenge of unreliable read/write operations in Antiferromagnetic Tunnel Junction (AFMTJ) memories due to their ultrafast dynamics and low tunnel magnetoresistance (TMR). They propose a device-circuit co-design approach, specifically an asymmetric pulse driver (PD) for write operations and a self-timed sense amplifier (STSA) with dynamic trip-point tuning for read operations. Simulation results demonstrate improved read/write yield under process, voltage, and temperature (PVT) variations and 3D integration parasitics compared to standard MRAM front-ends, while preserving AFMTJ latency and energy benefits.
Introduces a device-circuit co-designed read/write interface, comprising an asymmetric pulse driver and a self-timed sense amplifier with dynamic trip-point tuning, to enhance the robustness of AFMTJ memories under realistic operating conditions.
This paper addresses performance degradation in federated learning (FL) due to data heterogeneity and variable participation frequencies among nodes. They introduce PMFL, a model-contrastive FL framework that incorporates historical training information to improve model consistency and reduce performance fluctuations. PMFL demonstrates superior performance compared to existing FL methods in heterogeneous scenarios through extensive experimentation.
Introduces a model-contrastive federated learning framework (PMFL) that leverages historical local and global models to improve performance in heterogeneous federated learning scenarios.
This paper introduces a decentralized multi-robot system for detecting and tracking floating containers in maritime environments, using a team of UAVs and an autonomous surface vessel. The system employs YOLOv8 and stereo disparity for visual detection on each UAV, followed by per-object Extended Kalman Filters (EKFs) for tracking with uncertainty-aware data association. Track summaries are exchanged and fused using covariance intersection to maintain consistency, and an information-driven assignment module optimizes target allocation and UAV viewpoints.
Introduces a decentralized multi-robot perception framework that combines visual detection, EKF tracking with uncertainty-aware data association, conservative track fusion via covariance intersection, and information-driven task assignment for robust maritime object tracking.
This paper introduces a Collaborative Intrusion Detection System (CIDS) framework that dynamically optimizes the allocation of intrusion detectors across nodes in a layered network based on available resources and data types. The framework adapts to changing operational scenarios by reconfiguring detectors to maintain an optimal configuration without requiring heavy computation, making it suitable for edge device deployment. The evaluation, conducted using distributed datasets including a novel dataset based on a cyberattack targeting a ground drone, demonstrates the framework's ability to achieve adaptive and efficient intrusion detection.
Introduces a resource-aware CIDS framework that dynamically optimizes detector allocation in layered networks for efficient intrusion detection in resource-constrained environments.
The paper introduces SParse Expert Synchronization (SPES), a decentralized training framework for Mixture-of-Experts (MoE) LLMs that reduces memory footprint by training only a subset of experts per node and periodically synchronizing them. This approach addresses the GPU memory limitations of existing decentralized training methods, which still require training the entire model on each node. The authors demonstrate that SPES enables training of 2B, 7B, and 9B parameter MoE models on resource-constrained hardware, achieving performance comparable to centrally trained LLMs with similar computational budgets.
Introduces SParse Expert Synchronization (SPES), a memory-efficient decentralized training framework that enables pretraining large MoE language models on distributed GPUs with limited memory.
This paper analyzes the potential of 6G networks to enhance robotic systems by mapping IMT-2030 key performance indicators to robotic functional blocks like sensing, perception, and actuation. It argues that 6G's enhanced capabilities are crucial for enabling more complex and autonomous robotic systems. The paper proposes a high-level architectural framework integrating robotic, intelligent, and network service planes and demonstrates a real-time safety framework for human-robot collaboration as a use case.
Proposes a high-level architectural framework integrating robotic, intelligent, and network service planes to leverage 6G capabilities for advanced robotics.
This paper introduces a spectrum framework for polycentric digital ecosystems, conceptualizing them as nested socio-technical systems across personal, organizational, inter-organizational, and global layers. It addresses the increasing need for resilient digital collaboration amidst geopolitical and technological fragmentation. The framework highlights how AI and automation, blockchain trust, federated data spaces, and immersive technologies can orchestrate digital integration in these ecosystems.
Introduces a multi-layered framework for polycentric digital ecosystems to facilitate collaboration in fragmented environments.
The paper introduces SparrowRL, a novel RL training system designed to overcome bandwidth limitations in commodity-networked GPU resources by exploiting the sparsity of per-step updates during RL fine-tuning. SparrowRL achieves this by representing updates as sparse delta checkpoints, pipelining delta extraction with multi-stream transmission, overlapping transfer with rollout generation, and employing throughput- and bandwidth-aware scheduling. Experiments on Qwen3 models show SparrowRL reduces per-step transfer payload by 79x and improves throughput by 2.4-9.5x over full-weight broadcast across WAN, achieving comparable throughput to RDMA clusters with improved cost efficiency.
Introduces SparrowRL, a system that enables efficient RL training over commodity networks by leveraging sparse delta checkpoints and bandwidth-aware scheduling to minimize communication overhead.
This paper addresses the computational bottleneck introduced by post-quantum cryptography (PQC) in Open Radio Access Networks (O-RAN) control planes, which impacts energy efficiency. They propose an energy-aware framework with a Crypto Policy rApp and a Security Operations Scheduling (SOS) xApp to strategically manage PQC suites and optimize cryptographic enforcement timing and placement. Through discrete-event simulation, the proposed scheduling approach achieves a 60% reduction in per-handshake energy consumption without compromising slice latency targets.
Introduces an energy-aware scheduling framework for PQC handshakes in O-RAN that minimizes energy consumption while meeting slice latency requirements.
The paper introduces DMind-3, a three-layered Edge-Local-Cloud AI system for secure and low-latency Web3 financial transactions. It addresses the limitations of cloud-centric and purely local AI solutions by using a deterministic edge firewall, a private local reasoning engine, and a policy-governed cloud synthesizer. The system is trained with Hierarchical Predictive Synthesis (HPS) and Contrastive Chain-of-Correction Supervised Fine-Tuning (C$^3$-SFT) to improve performance and reliability.
Introduces a novel Edge-Local-Cloud AI architecture, DMind-3, that balances privacy, latency, and global context for secure Web3 transactions.
The paper introduces PPTAM$\eta$, a CI/CD pipeline integrated with GitLab CI, designed to measure the energy consumption of containerized API systems during rapid deployment cycles. It addresses the gap in current CI/CD practices by incorporating power and energy measurement, revealing the impact of code changes on energy efficiency. The evaluation on a JWT-authenticated API demonstrates the pipeline's ability to collect performance and energy metrics across different commits, enabling version comparison and trend analysis.
Introduces an automated CI/CD pipeline, PPTAM$\eta$, that integrates power and energy measurement into GitLab CI for containerized API systems, enabling energy-aware development.
This paper investigates the relationship between performance antipatterns and energy consumption in microservice architectures by implementing ten common antipatterns as isolated microservices and measuring their performance, CPU/DRAM power consumption, and resource utilization under controlled load. The study reveals that while all implemented antipatterns degrade performance, only a subset significantly increase power consumption, with some reaching CPU saturation and others exhibiting energy-performance coupling. The findings provide a basis for identifying performance antipatterns that also act as energy antipatterns, offering insights for energy-efficient microservice design.
Empirically demonstrates that not all performance antipatterns in microservices lead to increased power consumption, identifying specific cases where performance degradation does not correlate with higher energy usage due to CPU saturation effects.
The paper introduces DEL, a framework for differentially private and communication-efficient split inference of large language models (LLMs). DEL uses an embedding projection module and differentially private stochastic quantization to reduce communication overhead while preserving privacy. It then employs soft prompts on the server side to mitigate utility degradation caused by the privacy mechanisms, eliminating the need for local models.
Introduces a novel framework, DEL, that leverages soft prompts to improve the privacy-utility trade-off in LLM split inference, achieving differential privacy and communication efficiency.
The paper introduces PrefillShare, an algorithm for sharing the prefill stage across multiple language models in disaggregated serving environments to reduce redundant computation and KV cache storage. PrefillShare factorizes models into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module, enabling multiple models to share a prefill module and its KV cache. Experiments demonstrate that PrefillShare achieves comparable accuracy to full fine-tuning while significantly improving latency (4.5x lower p95) and throughput (3.9x higher) in multi-model agent workloads.
Introduces PrefillShare, a novel algorithm that enables efficient sharing of the prefill stage and KV cache across multiple language models in a disaggregated serving system.
This paper introduces Processing Across Memory (PAM), a KV-centric LLM serving system designed to address the memory bandwidth and capacity bottlenecks in LLM serving. PAM employs a hierarchical memory architecture with heterogeneous PIM-enabled devices, distributing KV tokens based on context locality and introducing the PAMattention algorithm for parallel attention computation. The system further incorporates dynamic KV scheduling and migration to balance computational workloads across devices, leading to enhanced efficiency and scalability.
Introduces a hierarchical memory architecture and associated algorithms for LLM serving that coordinates heterogeneous PIM-enabled memory devices to balance high memory bandwidth with scalable capacity.
The paper addresses the challenge of domain adaptation in multi-agent collaborative perception for V2X systems, where directly applying parameter-efficient fine-tuning (PEFT) leads to performance degradation. They identify inter-frame redundancy and semantic erosion as key issues and propose FlowAdapt, a PEFT framework based on optimal transport. FlowAdapt uses Wasserstein Greedy Sampling to filter redundant samples and Progressive Knowledge Transfer to inject early-stage representations into later stages, achieving state-of-the-art performance with only 1% trainable parameters.
Introduces FlowAdapt, a parameter-efficient domain adaptation framework for collaborative perception that leverages optimal transport to minimize information transport costs across data distributions and network hierarchies.
The paper introduces GORGO, a method for cross-region LLM load balancing that minimizes Time-to-First-Token (TTFT) by jointly optimizing for compute availability, network latency, and KV-cache reuse. GORGO models a total serving cost function and uses it to make routing decisions, addressing the limitations of existing approaches that either ignore network latency or suffer from synchronization overhead. Experiments on custom infrastructure demonstrate that GORGO reduces P99 TTFT through network-aware routing and achieves a 2.5x speedup in median TTFT compared to prior methods by using a centralized HTTP proxy.
Introduces a network-aware routing policy, GORGO, that minimizes TTFT in cross-region LLM inference by optimizing a cost function that considers compute, network latency, and KV-cache reuse.
The paper introduces LAER-MoE, a framework for efficient Mixture-of-Experts (MoE) training that addresses load imbalance among experts during expert parallel training. LAER-MoE employs Fully Sharded Expert Parallel (FSEP), partitioning expert parameters across devices and restoring partial experts via All-to-All communication, enabling dynamic re-layout of experts to improve load balancing. Experiments on an A100 cluster demonstrate up to 1.69x speedup compared to existing state-of-the-art MoE training systems.
Introduces a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), that dynamically re-layouts expert parameters during training to mitigate load imbalance in Mixture-of-Experts models.
This paper introduces D-NSS, a decentralized stochastic optimization algorithm designed to handle heterogeneous variances in stochastic gradient estimators across nodes in distributed networks. The algorithm achieves a sample complexity that depends on the arithmetic mean of local standard deviations, improving upon existing methods that rely on worst-case or quadratic mean dependencies. The authors also prove a matching sample complexity lower bound, demonstrating the optimality of the arithmetic-mean dependence, and propose a variance-reduced version, D-NSS-VR, with improved sample complexity under mean-squared smoothness.
Establishes a decentralized stochastic optimization algorithm, D-NSS, with provable sample complexity optimality under heterogeneous variance by demonstrating a matching lower bound dependent on the arithmetic mean of local standard deviations.
The paper introduces FedGRPO, a federated learning framework for optimizing foundation models by leveraging data from domain clients while preserving privacy. It frames the problem as a reinforcement learning task where a server model learns from scalar reward signals provided by expert clients selected using a competence-based confidence graph. FedGRPO aggregates these rewards using a federated group-relative loss function, achieving improved downstream accuracy and communication efficiency compared to existing federated foundation model approaches.
Introduces FedGRPO, a privacy-preserving federated learning framework that optimizes foundation models by aggregating group-relative reward signals from expert clients selected via a competence-based confidence graph.
This paper introduces DeepFusionKernel, a deeply fused kernel designed to optimize the memory bandwidth bottleneck caused by large SwiGLU MLP blocks in agentic LLM inference with long contexts. By reducing HBM traffic and improving cache reuse, DeepFusionKernel significantly accelerates inference. Experiments demonstrate speedups of up to 13.2% on H100 and 9.7% on A100 GPUs compared to SGLang.
Introduces a deeply fused kernel, DeepFusionKernel, that optimizes memory bandwidth usage for SwiGLU MLP blocks in transformer models, leading to faster inference.
The paper introduces AUC-RAC, an auction-based mechanism for optimizing task allocation and resource utilization in IoT environments using Docker Swarm. It addresses the challenge of efficiently offloading computation-intensive tasks to multiple local servers by employing an auction-based bidding process among worker nodes managed by a manager node. Experimental results demonstrate improved offloading and computation performance through optimized resource allocation.
Introduces an auction-based mechanism, AUC-RAC, to optimize task allocation among local servers in a Docker Swarm environment for IoT devices, considering resource sufficiency.
This paper introduces Cachemir, a novel framework for fully homomorphic encrypted (FHE) inference of generative LLMs that addresses the inefficiency of integrating KV caches in existing FHE solutions. Cachemir achieves this by developing HE packing algorithms tailored for KV cache utilization, an interleaved replicated packing algorithm for efficient vector-matrix multiplications, and an augmented bootstrapping placement strategy to minimize bootstrapping costs. Experiments show that Cachemir significantly outperforms state-of-the-art FHE inference frameworks like MOAI and THOR, achieving up to 67x speedup and generating tokens for Llama-3-8B in under 100 seconds on GPU.
Introduces a novel fully homomorphic encryption (FHE) inference framework, Cachemir, that significantly accelerates generative LLM inference by efficiently integrating and optimizing the KV cache.
This paper presents a hardware implementation of semi-empirical electronic structure methods, specifically Extended Hückel Theory (EHT) and non-self-consistent Density Functional Tight Binding (DFTB0), on a field-programmable gate array (FPGA). By implementing Hamiltonian construction and diagonalization directly on the FPGA using a streaming dataflow architecture, the design achieves deterministic execution and eliminates host intervention. The FPGA-based DFTB0 Hamiltonian generator demonstrates a greater than fourfold throughput improvement compared to a server-class CPU on a mid-range Artix-7 FPGA, highlighting the potential for significant acceleration.
Demonstrates a hardware-native implementation of semi-empirical electronic structure theory on an FPGA, achieving superior throughput compared to a CPU.
The paper introduces ECHO-2, a distributed reinforcement learning framework designed to optimize the post-training of large language models by distributing rollout execution across remote inference workers. ECHO-2 addresses challenges related to wide-area coordination and policy dissemination latency by treating policy staleness as a user-controlled parameter and overlapping rollout generation, dissemination, and training. Experimental results on GRPO post-training of 4B and 8B models demonstrate that ECHO-2 achieves significant cost efficiency improvements while maintaining comparable RL reward performance.
Introduces ECHO-2, a distributed RL framework that optimizes cost efficiency in LLM post-training by overlapping rollout generation, dissemination, and training, and managing policy staleness.
The paper introduces NetWorld, a Communication-based Diffusion World Model, to improve few-shot generalization across heterogeneous MARL tasks in wireless networks. NetWorld pre-trains a classifier-guided conditional diffusion model on multi-task offline datasets and performs trajectory planning within the learned world model, avoiding online interaction. The model incorporates a mean-field communication mechanism to address non-stationarity and promote coordination.
Introduces a communication-based diffusion world model (NetWorld) that enables few-shot generalization across heterogeneous MARL tasks in wireless networks by learning from offline data and planning within the learned environment.
This paper introduces a multifaceted approach to accelerate reservoir simulation by combining advanced software techniques, AI/ML-based enhancements, and scalable hardware solutions. The software innovations include multiscale SFI methods for black-oil models and AI/ML for phase labeling and saturation pressure prediction in compositional models, while hardware strategies involve CPU, GPU, and cloud-based execution. Applied to real-world, high-resolution reservoirs, the combined approach achieved up to 4x runtime reduction using multiscale SFI, up to 2x speedups with full-GPU execution, and up to 4x improvements with AI/ML, significantly compressing field development planning timelines.
Demonstrates a holistic approach to reservoir simulation acceleration by integrating multiscale physics, AI/ML-driven enhancements, and scalable compute infrastructure, resulting in significant runtime reductions and faster field development planning.
The authors present a refactored and optimized Python framework, built upon the LAMOST Atmospheric Parameter Pipeline (LASP), for scalable stellar parameter inference from large spectroscopic datasets. The framework includes a CPU-optimized module (LASP-CurveFit) and a GPU-accelerated module (LASP-Adam-GPU) that uses grouped optimization to process multiple spectra simultaneously. Applied to 10 million LAMOST spectra, the framework achieves significant speedups (reducing runtime to 7 hours on an NVIDIA A100 GPU) while maintaining accuracy and demonstrating improved transferability to the DESI DR1 dataset compared to the DESI pipeline, particularly for effective temperature and surface gravity of cool giants.
Introduces a modular, parallelized, and GPU-accelerated Python framework for stellar parameter inference that achieves significant speedups and improved accuracy compared to existing pipelines, particularly for cool giants.
This paper analyzes the challenges of implementing 3D parallelism in heterogeneous GPU environments, focusing on symmetric tensor parallelism and efficient gradient synchronization in asymmetric pipeline parallelism. The authors introduce AutoHet, a system that automatically optimizes the parallelism plan for distributed training on heterogeneous GPUs by framing device grouping and load balancing as an optimization problem. Experiments on large-scale models with diverse GPU combinations demonstrate that AutoHet achieves up to 1.79x training throughput speedup compared to Megatron-LM and Whale, and a 4.38x speedup in recovery speed compared to a spot instance baseline.
Introduces AutoHet, a novel system that automatically identifies and optimizes the parallelism strategy for distributed training across heterogeneous GPUs, considering device grouping, load balancing, and efficient recovery from spot instance preemption.
The paper introduces SIGMA, an open-source training stack designed to enhance the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware by addressing system disruptions, numerical errors, and parallelism optimization complexities. SIGMA incorporates the LUCIA TRAINING PLATFORM (LTP), which achieved 94.45% effective cluster accelerator utilization, and the LUCIA TRAINING FRAMEWORK (LTF), which successfully trained a 200B MoE model (SIGMA-MOE) with 2,048 AI accelerators, reaching 21.08% MFU and state-of-the-art downstream accuracy. This work demonstrates a robust and cost-effective alternative to existing accelerator stacks for large-scale AI training.
Introduces SIGMA, a comprehensive training stack that significantly improves the reliability, stability, and efficiency of large-scale AI training on early-life hardware.
This paper introduces a low-cost, open-source neuromorphic processor implemented on a Xilinx Zynq-7000 FPGA, designed to facilitate experimentation with spiking neural networks. The processor features all-to-all configurable connectivity and utilizes the leaky integrate-and-fire (LIF) neuron model with tunable parameters, enabling runtime reconfiguration via a UART interface. Validation on Iris and MNIST datasets demonstrates the design's energy efficiency and scalability, positioning it as a practical research platform.
Presents a flexible and accessible neuromorphic processor on an FPGA with all-to-all connectivity and runtime reconfigurability, intended for open-source release.
The paper explores the use of LLM-based multi-agent systems for optimizing PyTorch inference on GPUs, aiming to surpass traditional compilers and manual kernel development. They introduce a logical framework to compare different multi-agent optimization strategies, focusing on the dynamics between exploration, exploitation, and error-fixing. Their best system, combining exploit-heavy strategies with error-fixing agents and fine-grained optimization steps, achieves a 2.88x speedup on an H100 GPU across KernelBench.
Systematically analyzes the dynamics of LLM-based multi-agent systems for PyTorch inference optimization, revealing the importance of balancing exploitation with error correction and the impact of optimization granularity.
This paper introduces a hierarchical profiling methodology for DNN performance analysis, addressing the increasing complexity and computational demands of modern AI systems. The methodology integrates tools like cProfile, PyTorch Profiler, and NVIDIA Nsight Systems to analyze performance across different layers of abstraction, from code to GPU execution. A case study using VGGNet demonstrates the approach's ability to trace performance bottlenecks from high-level code to low-level GPU operations.
Introduces a hierarchical profiling methodology that integrates multiple profiling tools to systematically analyze DNN performance across different levels of abstraction.
The paper introduces Instella, a family of fully open 3B parameter language models trained on publicly available data, addressing the lack of transparency in high-performing LLMs. Instella achieves state-of-the-art performance among fully open models of comparable size, despite using fewer pre-training tokens. The authors also release Instella-Long (128K context) and Instella-Math (reasoning-focused) variants, demonstrating the versatility of the base model.
Introduces Instella, a family of fully open 3B parameter language models, achieving state-of-the-art performance among fully open models and demonstrating competitive results with leading open-weight models of comparable size.
This paper introduces MTTR-A, a novel runtime reliability metric for multi-agent systems that quantifies cognitive recovery latency, addressing the increasing limitation of cognitive failures over infrastructure faults. MTTR-A adapts classical dependability theory to agentic orchestration, measuring the time to detect reasoning drift and restore coherent operation, and is complemented by MTBF and a normalized recovery ratio (NRR). Empirical evaluation using a LangGraph-based benchmark with simulated drift and reflex recovery demonstrates measurable recovery behavior across multiple reflex strategies, establishing a quantitative foundation for runtime cognitive dependability.
Introduces MTTR-A, a new metric for quantifying cognitive recovery latency in multi-agent systems, adapting classical dependability theory to agentic orchestration.
The paper introduces OpenMENA, an open-source memristor interfacing system designed for energy-efficient edge AI applications, featuring a reproducible hardware interface, a firmware-software stack with high-level APIs, and a Voltage-Incremental Proportional-Integral (VIPI) programming method. OpenMENA enables weight transfer and on-device adaptation by mitigating device non-idealities through chip-in-the-loop fine-tuning. The system's efficacy is demonstrated through digit recognition and a real-world robot obstacle-avoidance task, showcasing its ability to map localization inputs to motor commands.
Introduces OpenMENA, the first fully open-source memristor interfacing system with integrated hardware, firmware, and software components for edge AI applications.
This paper addresses the challenge of efficiently serving multi-tenant deep learning inference requests on a single GPU in resource-constrained environments. They propose DRS, a Deep Reinforcement Scheduler, which jointly optimizes GPU resource allocation and request batching to maximize throughput and minimize job completion time. DRS leverages Deep Deterministic Policy Gradient (DDPG) for scheduling and NVIDIA Multi-Process Service (MPS) for spatial parallelism.
Introduces a deep reinforcement learning-based scheduler, DRS, that jointly optimizes resource allocation and request batching for multi-tenant GPU inference serving.

