Search papers, labs, and topics across Lattice.
The paper addresses the problem of performance degradation when fine-tuning offline-trained actor-critics online using value-based RL algorithms. They hypothesize that offline and online optima are separated by low-performance valleys in the loss landscape. To mitigate this, they introduce Score Matched Actor-Critic (SMAC), which regularizes the Q-function during offline training to align the policy score with the action-gradient of the Q-function. SMAC demonstrates smoother transfer to online algorithms like SAC and TD3 on D4RL tasks, achieving significant regret reduction in several environments.
No more performance cliffs: SMAC lets you smoothly fine-tune offline RL policies online by aligning policy gradients with Q-function gradients.
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.