Tsinghua AITencent AIFeb 12, 2026arXiv:2602.11800

Temporal Difference Learning with Constrained Initial Representations

Jiafei Lyu, Jingwen Yang, Zhongjian Qiao, Runze Liu, Deheng Ye, Zongqing Lu

AI Summary

This paper addresses the sample inefficiency of off-policy reinforcement learning by constraining the initial representations of input data to alleviate distribution shift. They introduce a novel framework, CIR, incorporating a Tanh activation function in the initial layer, normalization techniques, skip connections, and convex Q-learning. Theoretical analysis demonstrates the convergence of temporal difference learning with the Tanh function under linear function approximation, and empirical results show CIR achieves strong performance on continuous control tasks.

Key Contribution

Constraining initial state representations with a simple Tanh activation and skip connections can significantly boost off-policy RL performance, rivaling more complex methods on continuous control tasks.

Abstract

Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References93

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Temporal Difference Learning with Constrained Initial Representations

Related Papers