NVIDIAMeituanPKUMar 3, 2026arXiv:2603.02908

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Qi Zhang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Yisen Wang

AI Summary

The paper introduces the SAE-based Transferability Score (STS), a metric that uses sparse autoencoders (SAEs) to predict the transferability of large language models (LLMs) after supervised fine-tuning. STS identifies shifted dimensions in SAE representations and correlates them with downstream domain performance, enabling transferability estimation without fine-tuning. Experiments across multiple models and domains demonstrate that STS accurately predicts transferability, achieving Pearson correlation coefficients above 0.7 with actual performance changes, and shows initial promise for reinforcement learning.

Key Contribution

Predict how well your LLM will transfer to a new domain *before* fine-tuning, by using sparse autoencoders to spot tell-tale signs of domain shift in the model's representations.

Abstract

In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Related Papers