Tsinghua AIJun 14, 2026arXiv:2606.15734

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

AI Summary

This paper introduces ReGrad (Retrievable Gradients), a novel approach to continual post-training that mitigates weight drift by treating gradients as retrievable knowledge units. By pre-computing document-specific gradients and storing them in a Gradient Bank, ReGrad allows for query-relevant gradient retrieval at inference, enhancing the model's adaptability without the risk of catastrophic forgetting. Experimental results demonstrate that ReGrad significantly outperforms existing methods like CPT and RAG, providing a scalable solution for integrating new knowledge while preserving model integrity.

Key Contribution

ReGrad enables scalable and reversible knowledge injection without the risk of catastrophic forgetting, outperforming traditional methods in both general and domain-specific tasks.

Abstract

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

Related Papers