Search papers, labs, and topics across Lattice.
This paper introduces ReGrad (Retrievable Gradients), a novel approach to continual post-training that mitigates weight drift by treating gradients as retrievable knowledge units. By pre-computing document-specific gradients and storing them in a Gradient Bank, ReGrad allows for query-relevant gradient retrieval at inference, enhancing the model's adaptability without the risk of catastrophic forgetting. Experimental results demonstrate that ReGrad significantly outperforms existing methods like CPT and RAG, providing a scalable solution for integrating new knowledge while preserving model integrity.
ReGrad enables scalable and reversible knowledge injection without the risk of catastrophic forgetting, outperforming traditional methods in both general and domain-specific tasks.
Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.