Training Efficiency & Optimization

Infrastructure

Efficient training methods, optimizer design, learning rate schedules, mixed precision, and gradient techniques.

Keywords

training efficiencyoptimizerlearning ratemixed precisiongradient checkpointingAdamWtraining optimizationconvergence

Recent Papers

Mar 1, 2026

just now

MoReL: A Generalizable Framework for Dexterous Hand Retargeting via Modular Residual Reinforcement Learning

The paper introduces Modular Residual Reinforcement Learning (MoReL), a novel RL framework for dexterous hand retargeting that decomposes policy learning into finger-specific subpolicies and a residual coordination module. This decomposition enables efficient training from minimal demonstrations, low-latency inference, and flexible input modalities, addressing limitations of optimization-based and learning-based methods. Experiments demonstrate MoReL's superior performance and cross-platform adaptability in fine-grained dexterous manipulation tasks, validating the effectiveness of the architecture and reward design.

Introduces a modular reinforcement learning framework that decomposes dexterous hand retargeting into finger-specific subpolicies and a residual coordination module to improve generalization and reduce training data requirements.

Zhenghan Wang, Yongkang Luo, Dashun Yan +5

Robotics & Embodied AITraining Efficiency & Optimization

just now

Scalable Multi-Agent Reinforcement Learning Framework for Multi-Machine Tending

This paper introduces SMAPPO, a scalable multi-agent reinforcement learning framework for decentralized multi-robot management in multi-machine tending scenarios. SMAPPO employs a novel observation encoder to achieve input-size invariance, enabling it to handle varying numbers of agents, machines, and storage areas without retraining. Experiments demonstrate that SMAPPO outperforms MAPPO in full retraining, curriculum learning, zero-shot generalization, and adaptability under low initial training, showing significant improvements in productivity, collision avoidance, and parts delivery.

Introduces a novel observation encoder for MAPPO that enables zero-shot generalization to variable numbers of agents and machines in multi-agent reinforcement learning.

A. Abdalwhab, Giovanni Beltrame, D. St-Onge

Reasoning & Chain-of-ThoughtRobotics & Embodied AIDistributed Systems & HardwareTraining Efficiency & Optimization

Feb 12, 2026

2d ago

HLA: Hadamard Linear Attention

This paper introduces Hadamard Linear Attention (HLA), a novel linear attention mechanism designed to more accurately approximate softmax attention. HLA applies a nonlinearity after the computation of pairwise similarities, unlike existing linear attention methods that apply nonlinear kernel functions independently to queries and keys. The authors demonstrate that this approach results in a higher-degree rational function approximation of softmax and show its effectiveness in a large diffusion transformer model for video generation.

Introduces Hadamard Linear Attention (HLA), a linear attention variant that applies a nonlinearity after pairwise similarity computation to better approximate softmax.

Hanno Ackermann, Mohsen Ghafoorian, A. Habibian2602.12128

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationNatural Language Processing

2d ago

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

The paper introduces Seq2Seq2Seq, a novel lossless compression method using a T5 language model architecture trained with reinforcement learning to compress data into discrete token sequences. This approach preserves the token-based structure of the original data, unlike autoencoders that use continuous latent spaces, leading to improved compression ratios. The model is trained using an off-policy reinforcement learning algorithm to optimize sequence length for minimal redundancy.

Introduces Seq2Seq2Seq, a lossless compression method that leverages reinforcement learning to train a T5 language model to compress data into discrete token sequences, preserving the original token structure.

Mahdi Khodabandeh, Ghazal Shabani, Arash Yousefi Jordehi +12602.12146

Architecture Design (Transformers, SSMs, MoE)Inference & QuantizationTraining Efficiency & Optimization

2d ago

Differentially Private Perturbed Push-Sum Protocol and Its Application in Non-Convex Optimization

The paper introduces Differentially Private Perturbed Push-Sum (DPPS), a protocol-level differential privacy mechanism for decentralized communication networks that addresses the challenge of sensitivity estimation in each round by having nodes broadcast a single scalar. DPPS is then integrated into PartPSP, a privacy-preserving decentralized algorithm for non-convex optimization, which partitions model parameters into local and shared components and applies DPPS only to the shared parameters to reduce noise. Theoretical analysis and experimental results demonstrate that PartPSP achieves better optimization performance under the same privacy budget compared to existing methods.

Introduces a novel sensitivity estimation mechanism for protocol-level differential privacy in decentralized networks, enabling a lightweight and generalizable privacy-preserving communication protocol.

Yiming Zhou, Kaiping Xue2602.11544

Distributed Systems & HardwareTraining Efficiency & Optimization

2d ago

On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy

This paper investigates the impact of differential privacy (DP) mechanisms, namely gradient clipping and noise injection, on firing rate statistics within federated spiking neural networks (SNNs). The study demonstrates that DP significantly perturbs firing rates, leading to rate shifts, attenuated aggregation, and unstable client selection in a speech recognition task under non-IID data. The authors further link these rate shifts to sparsity and memory usage, providing insights into the trade-offs between privacy and performance in rate-based federated neuromorphic learning.

Quantifies the sensitivity of firing rate-based federated spiking neural networks to differential privacy mechanisms, revealing specific impacts on rate statistics, aggregation, and client selection.

M. Perkusich, Dalton Valadares, K. Gorgônio2602.12009

Training Efficiency & OptimizationDistributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

The paper introduces a pedagogically-inspired knowledge distillation framework (IOA) for transferring knowledge from large language models (LLMs) to smaller student models. The framework incorporates Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to dynamically identify knowledge deficiencies, organize knowledge delivery through progressive curricula, and adapt representations. Experiments using LLaMA and Qwen models demonstrate that IOA significantly outperforms baseline distillation methods, achieving higher performance on DollyEval, MATH, and HumanEval benchmarks while using significantly fewer parameters.

Introduces a novel three-stage knowledge distillation framework (IOA) that incorporates pedagogical principles to systematically improve student model performance by identifying knowledge gaps, organizing knowledge delivery, and adapting representations.

Yankai Chen, Xiaokun Zhang2602.12172

Training Efficiency & OptimizationInference & QuantizationData Curation & Synthetic Data

2d ago

Accelerating Robotic Reinforcement Learning with Agent Guidance

The paper introduces Agent-guided Policy Search (AGPS), a novel reinforcement learning framework that replaces human supervisors with a multimodal agent to improve sample efficiency in robotic manipulation tasks. AGPS leverages the agent as a semantic world model, using executable tools to provide corrective waypoints and spatial constraints for exploration. Experiments on precision insertion and deformable object manipulation tasks demonstrate that AGPS outperforms Human-in-the-Loop methods, achieving better sample efficiency by automating the supervision pipeline.

Introduces Agent-guided Policy Search (AGPS), a framework that automates robot reinforcement learning by using a multimodal agent to provide corrective guidance, thereby improving sample efficiency and scalability compared to human-in-the-loop methods.

Zili Zou, Yaoxiang Pu, Haotong Zhang +22602.11978

Robotics & Embodied AIRLHF & Preference LearningTraining Efficiency & Optimization

2d ago

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

The paper addresses the computational inefficiency of evolutionary AI agents that repeatedly invoke LLMs by proposing AdaptEvolve, a framework for adaptive LLM selection during evolutionary refinement. AdaptEvolve uses intrinsic generation confidence to estimate real-time solvability and dynamically selects an LLM appropriate for the current generation step. Experiments demonstrate that confidence-driven selection achieves a better Pareto frontier, reducing inference costs by 37.9% while maintaining 97.5% of the accuracy of static large models.

Introduces AdaptEvolve, a novel adaptive LLM selection framework for evolutionary AI agents that leverages intrinsic generation confidence to dynamically choose the most efficient LLM for each generation step.

Pretam Ray, P. Brahma, E. Barsoum2602.11931

Tool Use & AgentsInference & QuantizationTraining Efficiency & Optimization

2d ago

Manifold-Aware Temporal Domain Generalization for Large Language Models

This paper addresses temporal domain generalization (TDG) for LLMs by reformulating it geometrically under parameter-efficient fine-tuning. It posits that the low-dimensional temporal structure of model evolution can be preserved under parameter-efficient reparameterization. The authors introduce Manifold-aware Temporal LoRA (MaT-LoRA), which constrains temporal updates to a shared low-dimensional manifold within a low-rank adaptation subspace, modeling its evolution through a structured temporal core, and achieving superior temporal generalization performance with practical scalability.

Introduces MaT-LoRA, a parameter-efficient fine-tuning method that constrains temporal updates to a low-dimensional manifold within a LoRA subspace and models its evolution with a structured temporal core for improved temporal domain generalization in LLMs.

Xinyuan Song, Xuan Song, Ryosuke Shibasaki2602.11965

Training Efficiency & OptimizationNatural Language Processing

2d ago

LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection

This paper introduces a continuous learning architecture for edge-based malware detection that leverages LoRA adapters to enable local adaptation and global knowledge sharing in resource-constrained environments. The approach fine-tunes lightweight transformer models (DistilBERT, DistilGPT-2, TinyT5) locally on edge devices and aggregates/redistributes only the LoRA modules, avoiding the exchange of raw data. Experiments on Edge-IIoTset and TON-IoT datasets demonstrate that this LoRA-based exchange improves accuracy by 20-25% when encountering unseen attacks, while maintaining stable performance and adding minimal overhead to model size.

Proposes a parameter-efficient continuous learning framework for edge-based malware detection that uses LoRA to facilitate knowledge sharing between edge devices without transmitting raw data.

Christian Rondanini, B. Carminati, Elena Ferrari +22602.11655

Training Efficiency & OptimizationInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Mitigating Mismatch within Reference-based Preference Optimization

The paper identifies a "premature satisfaction" issue in Direct Preference Optimization (DPO) where the reference policy's preference for rejected responses attenuates the gradient even when the policy is still incorrect. To address this, they propose Hybrid-DPO (HyPO), a modification that conditionally applies the reference signal, treating it as neutral when pessimistic. HyPO improves inference-aligned metrics and pairwise win rates by strengthening per-example learning signals on pessimistic pairs while maintaining DPO's objective form and computational cost.

Introduces Hybrid-DPO (HyPO), a drop-in replacement for DPO that conditionally debiases the reference signal to mitigate premature satisfaction in pessimistic pairs.

Xin Yu, Jiyang Zheng, Dadong Wang +22602.11902

RLHF & Preference LearningTraining Efficiency & Optimization

2d ago

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

The paper introduces Temperature Adaptive Meta Policy Optimization (TAMPO), a novel framework that learns to control the temperature hyperparameter of an LLM during reinforcement learning. TAMPO uses a hierarchical two-loop process where an inner loop updates the LLM policy using trajectories sampled at temperatures selected by a meta-policy, and an outer loop updates the meta-policy to favor temperatures that maximize the likelihood of high-advantage trajectories. Experiments on mathematical reasoning benchmarks demonstrate that TAMPO outperforms baselines with fixed or heuristic temperature schedules, showing the effectiveness of learned temperature control for adaptive exploration.

Introduces a hierarchical reinforcement learning framework, TAMPO, that learns a meta-policy to dynamically adjust the temperature parameter of an LLM, optimizing exploration during policy learning.

Haoran Dang, Cuiling Lan, Hai Wan +22602.11779

RLHF & Preference LearningTraining Efficiency & OptimizationNatural Language Processing

2d ago

Dopamine: Brain Modes, Not Brains

The paper introduces a novel parameter-efficient fine-tuning (PEFT) method called \methodname{} that adapts large pretrained models by learning per-neuron thresholds and gains in activation space, inspired by neuromodulation. This approach aims to change the mode of computation by selecting and rescaling existing computations rather than rewriting weights, offering improved interpretability. Experiments on MNIST and rotated MNIST demonstrate that \methodname{} can improve accuracy over a frozen baseline with significantly fewer trainable parameters than LoRA, while also enabling neuron-level attribution and conditional computation.

Introduces \methodname{}, a parameter-efficient fine-tuning method that learns per-neuron thresholds and gains in activation space to adapt pretrained models by changing the mode of computation.

S. Ghasemlou2602.11726

Interpretability & Mechanistic InterpTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Free Lunch for Stabilizing Rectified Flow Inversion

This paper addresses the instability issues in Rectified Flow (RF) inversion, which arise from accumulated approximation errors during the inversion process. They introduce Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities within a theoretically-motivated spherical Gaussian constraint. The authors further propose mimic-CFG, a velocity correction scheme for editing tasks that interpolates between the current velocity and its projection onto the historical average.

Introduces Proximal-Mean Inversion (PMI) and mimic-CFG, two novel, training-free methods to stabilize Rectified Flow inversion and improve image reconstruction and editing fidelity.

Chenru Wang, Beier Zhu2602.11850

Computer VisionTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

HRL Laboratories2d ago

Rapid Dissipative Ground State Preparation at Chemical Transition States

This paper introduces a dissipative ground state preparation protocol tailored for simulating chemical reactions, specifically targeting strongly correlated transition states that are difficult for traditional methods. The protocol propagates a state along a discretized reaction coordinate using Procrustes-aligned orbital rotations, stabilized by engineered dissipative cooling. The authors demonstrate that for reaction paths satisfying a localized Eigenstate Thermalization Hypothesis (ETH) drift condition, the algorithm achieves ground state preparation with a gate complexity of $\widetilde{O}(N_o^{3}/\epsilon_E)$, and provide resource estimates for relevant chemical systems.

Introduces a dissipative ground state preparation protocol leveraging Procrustes-aligned orbital rotations and engineered dissipation to efficiently prepare ground states at chemical transition states.

Thomas W. Watts, Daniel Collins, Nam Nguyen +32602.11603

Scientific Discovery & Drug DesignTraining Efficiency & Optimization

2d ago

T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

This paper introduces Trajectory Self-Distillation (T3D), a novel framework for improving the generation quality of few-step Diffusion Language Models (DLLMs) by distilling the model's own generative trajectories. T3D incorporates Direct Discriminative Optimization (DDO), a reverse-KL objective, to encourage mode-seeking behavior during distillation, focusing the student model on high-probability regions of the teacher model's output space. Experiments across various benchmarks demonstrate that T3D significantly outperforms existing few-step DLLM baselines, substantially reducing the performance gap with full-step decoding.

Introduces a trajectory self-distillation framework, T3D, that leverages direct discriminative optimization to improve the generation quality of few-step diffusion language models.

Tunyu Zhang, Xinxi Zhang, Ligong Han +22602.12262

Inference & QuantizationTraining Efficiency & OptimizationNatural Language Processing

2d ago

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

This paper introduces Distribution Discriminant Theory (DDT) to quantify the alignment between training data and the model-induced distribution in supervised fine-tuning (SFT) of LLMs. Based on DDT, they propose In-Distribution Finetuning (IDFT), a loss-level method, and Hinted Decoding, a data-level technique, to improve generalization by aligning the training data distribution with the model's. Experiments show that the proposed framework achieves generalization performance comparable to offline RL methods like DPO and SimPO, while retaining the efficiency of SFT.

Introduces Distribution Discriminant Theory (DDT) to quantify and improve the alignment between training data and model-induced distributions in LLM supervised fine-tuning.

Miaosen Zhang, Yishan Liu, Shuxia Lin +52602.12222

RLHF & Preference LearningTraining Efficiency & OptimizationNatural Language Processing

American University of Armenia2d ago

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

The paper introduces PLESS, a pseudo-label enhancement strategy for weakly supervised segmentation using scribble annotations, addressing the limitations of noisy and incomplete supervision. PLESS leverages a hierarchical partitioning of the image into spatially coherent regions to propagate scribble information and refine pseudo-labels within these regions. Experiments on cardiac MRI datasets demonstrate that PLESS consistently improves segmentation accuracy across different scribble-supervised algorithms.

Introduces a novel pseudo-label enhancement strategy, PLESS, that leverages hierarchical image partitioning to improve the reliability and spatial consistency of pseudo-labels in weakly supervised segmentation.

Yeva Gabrielyan, Varduhi Yeghiazaryan, Irina Voiculescu Akian College of Science +82602.11628

Computer VisionTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

WaveFormer: Wavelet Embedding Transformer for Biomedical Signals

The paper introduces WaveFormer, a transformer architecture tailored for biomedical signal classification, addressing limitations of standard transformers in capturing multi-scale frequency patterns in long sequences. WaveFormer incorporates wavelet decomposition in both the embedding construction via multi-channel DWT and positional encoding via Dynamic Wavelet Positional Encoding (DyWPE). Experiments across eight datasets for human activity recognition and brain signal analysis demonstrate WaveFormer's competitive performance by effectively integrating frequency-domain information.

Introduces a novel transformer architecture, WaveFormer, that integrates wavelet decomposition into both the embedding and positional encoding stages to improve biomedical signal classification.

Habib Irani, Bikram De, V. Metsis2602.12189

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

2d ago

General Humanoid Whole-Body Control via Pretraining and Fast Adaptation

This paper introduces FAST, a humanoid whole-body control framework designed for fast adaptation and stable motion tracking. FAST employs Parseval-Guided Residual Policy Adaptation, learning a lightweight delta action policy with orthogonality and KL constraints for efficient adaptation to new motions. The framework also incorporates Center-of-Mass-Aware Control, enhancing balance by integrating CoM-related observations and objectives.

Introduces Parseval-Guided Residual Policy Adaptation, a novel method for efficiently adapting humanoid control policies to new motions by learning a lightweight delta action policy under orthogonality and KL constraints.

Zepeng Wang, Shiqing Yao, Yu Zhang +22602.11929

Robotics & Embodied AITraining Efficiency & Optimization

2d ago

Towards Performance-Enhanced Model-Contrastive Federated Learning using Historical Information in Heterogeneous Scenarios

This paper addresses performance degradation in federated learning (FL) due to data heterogeneity and variable participation frequencies among nodes. They introduce PMFL, a model-contrastive FL framework that incorporates historical training information to improve model consistency and reduce performance fluctuations. PMFL demonstrates superior performance compared to existing FL methods in heterogeneous scenarios through extensive experimentation.

Introduces a model-contrastive federated learning framework (PMFL) that leverages historical local and global models to improve performance in heterogeneous federated learning scenarios.

Hongliang Zhang, Jiguo Yu, Tianqing He +22602.11945

Training Efficiency & OptimizationDistributed Systems & Hardware

2d ago

Predicting LLM Output Length via Entropy-Guided Representations

This paper introduces a lightweight framework for predicting LLM output length by reusing the main model's internal hidden states, addressing the computational waste caused by excessive padding in batched inference. The framework consists of Entropy-Guided Token Pooling (EGTP) for static prediction and Progressive Length Prediction (PLP) for dynamic estimation during stochastic generation. Experiments on the newly introduced ForeLen benchmark demonstrate that EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% compared to existing methods, and improves end-to-end throughput when integrated with a length-aware scheduler.

Proposes a novel and efficient framework for LLM output length prediction that leverages entropy-guided token pooling and progressive length prediction to improve accuracy and reduce computational overhead.

Huanyi Xie, Yubin Chen, Liangyu Wang2602.11812

Inference & QuantizationTraining Efficiency & Optimization

2d ago

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

The paper introduces SParse Expert Synchronization (SPES), a decentralized training framework for Mixture-of-Experts (MoE) LLMs that reduces memory footprint by training only a subset of experts per node and periodically synchronizing them. This approach addresses the GPU memory limitations of existing decentralized training methods, which still require training the entire model on each node. The authors demonstrate that SPES enables training of 2B, 7B, and 9B parameter MoE models on resource-constrained hardware, achieving performance comparable to centrally trained LLMs with similar computational budgets.

Introduces SParse Expert Synchronization (SPES), a memory-efficient decentralized training framework that enables pretraining large MoE language models on distributed GPUs with limited memory.

Jinrui Zhang, Xindong Zhang, Lei Zhang2602.11543

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

The paper introduces LUVE, a latent-cascaded framework for ultra-high-resolution (UHR) video generation that tackles challenges in motion modeling, semantic planning, and detail synthesis. LUVE uses a three-stage architecture: low-resolution motion generation, latent upsampling, and high-resolution content refinement with dual frequency experts. Experiments demonstrate that LUVE achieves superior photorealism and content fidelity in UHR video generation compared to existing methods.

Introduces a novel latent-cascaded architecture with dual-frequency experts for generating ultra-high-resolution videos, improving both photorealism and content fidelity.

Chen Zhao, Jiawei Chen, Zhuoliang Kang +32602.11564

Computer VisionArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

The paper introduces Variance Minimisation Policy Optimisation (VMPO) for diffusion alignment, framing the process as Sequential Monte Carlo and minimizing the variance of log importance weights instead of using a KL divergence objective. This approach is motivated by the SMC interpretation of diffusion alignment where the denoising model acts as a proposal and reward guidance induces importance weights. The authors demonstrate that minimizing the variance objective leads to the reward-tilted target distribution and recovers existing KL-based alignment methods under specific conditions, while also suggesting novel alignment strategies.

Introduces Variance Minimisation Policy Optimisation (VMPO) as a novel objective for diffusion alignment, minimizing the variance of log importance weights within an SMC framework.

Zijing Ou, Jacob Si, Junyi Zhu +42602.12229

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Categorical Flow Maps

The paper introduces Categorical Flow Maps, a flow-matching method designed for fast, few-step generation of categorical data using self-distillation. By defining a continuous flow map towards the simplex, the method transports probability mass to a predicted endpoint, enabling the use of distillation techniques and a novel endpoint consistency objective. Experiments demonstrate state-of-the-art few-step generation performance across images, molecular graphs, and text, even achieving strong results in single-step generation.

Introduces a continuous flow-matching formulation for categorical data generation that enables self-distillation and endpoint consistency training, leading to accelerated sampling.

Daan Roos, Oscar Davis, Floor Eijkelboom +52602.12233

Inference & QuantizationTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions

The paper introduces a novel approach for irregular time series modeling by replacing Neural ODEs with a linear damped harmonic oscillator analogy that admits a closed-form solution, thereby avoiding computationally expensive numerical solvers. Keys and values are modeled as damped, driven oscillators, and the query is expanded in a sinusoidal basis, with attention modeled as a resonance phenomenon. The method is proven to maintain the universal approximation property of continuous-time attention and achieves state-of-the-art performance on irregular time series benchmarks with significant speedups.

Introduces a computationally efficient irregular time series model based on damped harmonic oscillators with closed-form solutions, demonstrating state-of-the-art performance and theoretical guarantees.

Yashas Shende, Aritra Das, Reva Laxmi Chauhan +12602.12139

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement&Routing for RF Circuits

This paper introduces an ML-driven physical synthesis framework for RF circuits that addresses limitations of prior ML approaches by incorporating EM-accurate component models and routing capabilities. They trained a neural network on a large dataset of inductor geometries to predict Q-factor with high accuracy, enabling gradient-based layout optimization. The framework integrates a P-Cell optimizer and a placement/routing engine with EM spacing rules, resulting in DRC-aware GDSII layouts.

Introduces an end-to-end ML-driven framework for RF physical synthesis that generates manufacturable GDSII layouts by integrating EM-aware neural inductor modeling with intelligent placement and routing.

Yilun Huang, Asal Mehradfar, Salman Avestimehr +12602.11461

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

The paper introduces SparrowRL, a novel RL training system designed to overcome bandwidth limitations in commodity-networked GPU resources by exploiting the sparsity of per-step updates during RL fine-tuning. SparrowRL achieves this by representing updates as sparse delta checkpoints, pipelining delta extraction with multi-stream transmission, overlapping transfer with rollout generation, and employing throughput- and bandwidth-aware scheduling. Experiments on Qwen3 models show SparrowRL reduces per-step transfer payload by 79x and improves throughput by 2.4-9.5x over full-weight broadcast across WAN, achieving comparable throughput to RDMA clusters with improved cost efficiency.

Introduces SparrowRL, a system that enables efficient RL training over commodity networks by leveraging sparse delta checkpoints and bandwidth-aware scheduling to minimize communication overhead.

Chaoyi Ruan, Geng Luo, Xinyi Wan +82602.11456

RLHF & Preference LearningDistributed Systems & HardwareTraining Efficiency & Optimization

2d ago

Solving the Post-Quantum Control Plane Bottleneck: Energy-Aware Cryptographic Scheduling in Open RAN

This paper addresses the computational bottleneck introduced by post-quantum cryptography (PQC) in Open Radio Access Networks (O-RAN) control planes, which impacts energy efficiency. They propose an energy-aware framework with a Crypto Policy rApp and a Security Operations Scheduling (SOS) xApp to strategically manage PQC suites and optimize cryptographic enforcement timing and placement. Through discrete-event simulation, the proposed scheduling approach achieves a 60% reduction in per-handshake energy consumption without compromising slice latency targets.

Introduces an energy-aware scheduling framework for PQC handshakes in O-RAN that minimizes energy consumption while meeting slice latency requirements.

Hamed Alimohammadi, Mohammad Shojafar, De Mi +12602.11820

Distributed Systems & HardwareTraining Efficiency & OptimizationInference & Quantization

2d ago

Scale-Invariant Fast Convergence in Games

This paper introduces novel learning dynamics for games that achieve fast convergence without requiring prior knowledge of the utility scale. For two-player zero-sum games, the authors develop scale-free and scale-invariant dynamics with $\tilde{O}(A_{\mathrm{diff}})$ external regret, while for multiplayer general-sum games, they achieve $O(U_{\mathrm{max}} \log T)$ swap regret. These dynamics are based on optimistic follow-the-regularized-leader with an adaptive learning rate and a new stopping-time analysis, along with a doubling clipping technique for general-sum games.

Develops scale-free and scale-invariant learning dynamics for both zero-sum and general-sum games that achieve fast convergence rates without requiring prior knowledge of the utility scale.

Taira Tsuchiya2602.11857

Training Efficiency & Optimization

2d ago

EqDeepRx: Learning a Scalable MIMO Receiver

The paper introduces EqDeepRx, a deep-learning-aided MIMO receiver that combines linear processing with learned components for improved scaling and generalization. EqDeepRx employs a shared-weight DetectorNN operating on individual spatial streams to achieve near-linear complexity scaling with multiplexing order, and uses a DenoiseNN to enhance channel estimation. End-to-end simulations demonstrate that EqDeepRx achieves improved error rate and spectral efficiency compared to conventional receivers while maintaining low complexity and supporting various MIMO configurations without retraining.

Introduces a novel deep-learning-aided MIMO receiver architecture, EqDeepRx, that achieves near-linear complexity scaling with multiplexing order through a shared-weight DetectorNN and enhances generalization via a DenoiseNN.

Mikko Honkala, Dani Korpi, Elias Raninen +12602.11834

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationInference & Quantization

2d ago

A Comparative Study of MAP and LMMSE Estimators for Blind Inverse Problems

This paper compares MAP and LMMSE estimators for blind deconvolution problems, focusing on scenarios with full knowledge of signal and kernel distributions. It finds that MAP estimators are unstable and require extensive tuning, even in controlled settings, while LMMSE provides a robust baseline. The study also demonstrates that LMMSE solutions can effectively initialize MAP methods, improving their performance and stability.

Empirically demonstrates the instability of MAP estimators compared to LMMSE in blind deconvolution and shows that LMMSE can effectively initialize MAP methods.

Nathan Buskulic, Luca Calatroni2602.11814

Training Efficiency & OptimizationInference & Quantization

2d ago

PPTAM$\eta$: Energy Aware CI/CD Pipeline for Container Based Applications

The paper introduces PPTAM$\eta$, a CI/CD pipeline integrated with GitLab CI, designed to measure the energy consumption of containerized API systems during rapid deployment cycles. It addresses the gap in current CI/CD practices by incorporating power and energy measurement, revealing the impact of code changes on energy efficiency. The evaluation on a JWT-authenticated API demonstrates the pipeline's ability to collect performance and energy metrics across different commits, enabling version comparison and trend analysis.

Introduces an automated CI/CD pipeline, PPTAM$\eta$, that integrates power and energy measurement into GitLab CI for containerized API systems, enabling energy-aware development.

Alessandro Aneggi, Xiaozhou Li, Andrea Janes2602.12081

Code Generation & Program SynthesisDistributed Systems & HardwareTraining Efficiency & Optimization

Applied AI Institute2d ago

U-Former ODE: Fast Probabilistic Forecasting of Irregular Time Series

The paper introduces U-Former ODE (UFO), a novel architecture for probabilistic forecasting of irregular time series data that combines U-Nets, Transformers, and Neural CDEs. UFO enables parallelizable computation and global receptive fields, addressing the scalability limitations of existing Neural CDE approaches. Experiments on five benchmarks demonstrate that UFO outperforms ten state-of-the-art baselines in predictive accuracy and achieves up to 15x faster inference, particularly on long and multivariate sequences.

Introduces a fully causal, parallelizable architecture, U-Former ODE (UFO), that integrates U-Nets, Transformers, and Neural CDEs for efficient and accurate probabilistic forecasting of irregular time series.

Ilya Kuleshov, Alexander Marusov, Alexey Zaytsev2602.11738

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationNatural Language Processing

2d ago

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.

Introduces the Trans-Chunk BiMamba (TC-BiMamba) architecture, enabling efficient dynamic chunk size training for unified streaming and non-streaming ASR.

Qingshun She, Yangui Fang, Yu Xi2602.11546

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

TH Koeln2d ago

A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

This paper introduces a technical curriculum designed to enhance AI literacy within the language and translation (L&T) industry, covering vector embeddings, neural networks, tokenization, and transformer networks. The curriculum aims to cultivate computational thinking, algorithmic awareness, and agency among L&T professionals to improve their digital resilience. Evaluation in an MA course at TH Koeln suggests the curriculum's effectiveness, while also highlighting the need for additional lecturer support to maximize learning outcomes.

Proposes and evaluates a technical curriculum focused on language-oriented AI to improve AI literacy and digital resilience in the language and translation industry.

Ralph Kruger2602.12251

Natural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

On the implicit regularization of Langevin dynamics with projected noise

The paper analyzes Langevin dynamics with noise projected onto directions orthogonal to an isometric group action, a model relevant to understanding symmetry effects in stochastic gradient descent for over-parameterized models. The key finding is that when initial and target densities are group-invariant, this projected Langevin dynamics is equivalent in law to standard Langevin dynamics with isotropic diffusion but with an additional drift term related to the negative log volume of the group orbit. This equivalence is proven through a coupling argument involving a third process on the group, identifying the drift as the mean curvature of the orbits, thus revealing a novel form of implicit regularization.

Establishes an equivalence between Langevin dynamics with projected noise and standard Langevin dynamics with an additional drift term proportional to the negative log volume of the group orbit, revealing a novel form of implicit regularization.

Govind Menon, Austin J. Stromme, Adrien Vacher2602.12257

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.

Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.

Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia2602.12236

Computer VisionTraining Efficiency & OptimizationInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Performance Antipatterns: Angel or Devil for Power Consumption?

This paper investigates the relationship between performance antipatterns and energy consumption in microservice architectures by implementing ten common antipatterns as isolated microservices and measuring their performance, CPU/DRAM power consumption, and resource utilization under controlled load. The study reveals that while all implemented antipatterns degrade performance, only a subset significantly increase power consumption, with some reaching CPU saturation and others exhibiting energy-performance coupling. The findings provide a basis for identifying performance antipatterns that also act as energy antipatterns, offering insights for energy-efficient microservice design.

Empirically demonstrates that not all performance antipatterns in microservices lead to increased power consumption, identifying specific cases where performance degradation does not correlate with higher energy usage due to CPU saturation effects.

Alessandro Aneggi, Vincenzo Stoico, Andrea Janes2602.12079

Distributed Systems & HardwareTraining Efficiency & OptimizationInference & Quantization

2d ago

Tiny Recursive Reasoning with Mamba-2 Attention Hybrid

This paper explores the use of Mamba-2 hybrid operators within Tiny Recursive Models (TRM) for abstract reasoning, motivated by Mamba-2's inherent iterative refinement properties. By replacing Transformer blocks in TRM with Mamba-2 hybrids while maintaining parameter parity, the authors demonstrate improved performance on the ARC-AGI-1 benchmark. Specifically, the Mamba-2 hybrid TRM achieves a +2.0% improvement in pass@2 and a +4.75% improvement in pass@100, suggesting enhanced candidate coverage.

Demonstrates that Mamba-2 hybrid operators can effectively replace Transformer blocks within Tiny Recursive Models, leading to improved performance on abstract reasoning tasks.

Wenlong Wang, Fergal Reid2602.12078

Reasoning & Chain-of-ThoughtArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

Insights on Muon from Simple Quadratics

This paper analyzes the Muon optimizer on simple strongly convex quadratic functions to understand its empirical success in large-scale training. It demonstrates that existing explanations based on single-step comparisons and worst-case guarantees are insufficient to explain Muon's behavior. The analysis reveals that approximation errors in the polar step and structural properties of the objective function significantly impact Muon's performance, suggesting the need for more nuanced theoretical frameworks.

Demonstrates that approximation errors in the polar step and structural properties of the objective function significantly impact Muon's performance on simple quadratics, challenging existing theoretical explanations.

Antoine Gonon, Andreea-Alexandra Mucsat, Nicolas Boumal2602.11948

Training Efficiency & Optimization

BNU-BNBU Institute of Artificial Intelligence and Future Networks2d ago

Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning

The paper introduces Meta-Sel, a supervised meta-learning approach for efficient demonstration selection in in-context learning, which addresses the challenge of selecting optimal few-shot examples under a limited prompt budget. Meta-Sel learns a scoring function based on TF-IDF cosine similarity and length-compatibility ratio between candidate demonstrations and queries, trained on a meta-dataset constructed from training data using class agreement as supervision. Empirical evaluation across four intent datasets and five LLMs demonstrates that Meta-Sel achieves competitive accuracy and selection-time overhead compared to 12 other demonstration selection methods, especially benefiting smaller models.

Introduces Meta-Sel, a lightweight supervised meta-learning approach that learns a fast, interpretable scoring function for selecting demonstrations for in-context learning.

Xubin Wang2602.12123

Natural Language ProcessingTraining Efficiency & OptimizationRecommendation & Information Retrieval

2d ago

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

The paper addresses the problem of excessive and unnecessary reflection in Large Reasoning Models (LRMs) that leads to increased token consumption and computational overhead without improving accuracy, especially in smaller models. To mitigate this, they propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a reinforcement learning framework that dynamically balances reasoning efficiency and solution accuracy by introducing reflection and length penalties. Experiments on mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and 7B models demonstrate that ARLCP achieves a superior efficiency-accuracy trade-off, reducing response length by up to 53.1% while improving accuracy by up to 5.8%.

Introduces ARLCP, a novel reinforcement learning framework with adaptive reflection and length penalties, to train LRMs for efficient reasoning by curtailing unnecessary reflective steps while preserving essential reasoning.

Zewei Yu, Yuke Zhu, Haobo Wang2602.12113

Reasoning & Chain-of-ThoughtTraining Efficiency & OptimizationInference & Quantization

2d ago

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

The paper introduces Composition-RL, a method to improve reinforcement learning of LLMs by composing multiple verifiable prompts into a single, more complex prompt, addressing the issue of diminishing returns from easy (pass-rate-1) prompts as training progresses. This approach aims to better utilize limited verifiable prompts by creating new training examples that maintain a high pass rate while increasing complexity. Experiments on models ranging from 4B to 30B parameters demonstrate that Composition-RL enhances reasoning capabilities and enables more effective cross-domain RL when combined with a curriculum learning strategy that gradually increases compositional depth.

Introduces Composition-RL, a novel method that composes multiple verifiable prompts to create more complex training examples for reinforcement learning of LLMs, thereby improving reasoning capabilities and cross-domain generalization.

Clive Bai, Weijie Liu, Yang Wang +22602.12036

RLHF & Preference LearningTraining Efficiency & OptimizationData Curation & Synthetic Data

Università della Svizzera2d ago

Improving Code Generation via Small Language Model-as-a-judge

This paper investigates the effectiveness of using small language models (SLMs) as judges to improve code generation, particularly in scenarios where large language models (LLMs) may underperform. The authors train and evaluate several state-of-the-art SLMs to discriminate between correct and incorrect code implementations, focusing on classification accuracy. Results demonstrate that modern SLMs, even without execution-based information, outperform previous approaches and achieve comparable performance to much larger LLMs when used as code rankers, offering a cost-effective alternative for code generation.

Demonstrates that modern small language models can effectively serve as code correctness judges and rankers, achieving performance competitive with much larger language models at a significantly reduced cost.

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota2602.11911

Code Generation & Program SynthesisTraining Efficiency & OptimizationOpen-Source Models & Weights

2d ago

Empirical Gaussian Processes

The paper introduces Empirical Gaussian Processes (GPs), a framework for constructing data-driven GP priors by empirically estimating the mean and covariance functions from historical observations. This approach overcomes limitations of handcrafted kernels, enabling the prior to reflect complex covariance structures present in the data. The authors derive an Expectation-Maximization algorithm with closed-form updates for learning the GP prior from independent datasets with heterogeneous observation locations, and demonstrate competitive performance on learning curve extrapolation and time series forecasting.

Introduces Empirical GPs, a novel method for learning GP priors directly from data by estimating the mean and covariance functions, thereby improving adaptability and reducing reliance on expert-defined kernels.

J. Lin, S. Ament, Louis C. Tiao +32602.12082

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

This paper introduces a lightweight RGB-D fusion framework to improve the efficiency and accuracy of Segment Anything Models (SAM). They augment EfficientViT-SAM with monocular depth priors generated by a pretrained estimator, fusing depth information mid-level with RGB features using a dedicated depth encoder. Training on only 11.2k samples, the proposed method outperforms EfficientViT-SAM, demonstrating the effectiveness of depth cues as geometric priors for segmentation.

Introduces a depth-aware fusion mechanism to enhance EfficientViT-SAM, enabling superior segmentation performance with significantly reduced training data.

Yiming Zhou, Xuenjie Xie, Panfeng Li +32602.11804

Computer VisionMultimodal ModelsTraining Efficiency & Optimization

2d ago

Temporal Difference Learning with Constrained Initial Representations

This paper addresses the sample inefficiency of off-policy reinforcement learning by constraining the initial representations of input data to alleviate distribution shift. They introduce a novel framework, CIR, incorporating a Tanh activation function in the initial layer, normalization techniques, skip connections, and convex Q-learning. Theoretical analysis demonstrates the convergence of temporal difference learning with the Tanh function under linear function approximation, and empirical results show CIR achieves strong performance on continuous control tasks.

Introduces a Constrained Initial Representations (CIR) framework that improves off-policy RL sample efficiency by constraining initial representations using a Tanh activation, normalization, skip connections, and convex Q-learning.

Jiafei Lyu, Jingwen Yang, Zhongjian Qiao +32602.11800

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

Lattice is designed for desktop

Training Efficiency & Optimization

Keywords

Recent Papers

Search