Stanford HAIFeb 24, 2026arXiv:2602.20904

Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts

AI Summary

The paper introduces transcoder adapters, a method for approximating and interpreting the difference in MLP computations in language models before and after reasoning fine-tuning. They apply this technique to compare Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B, demonstrating that the learned adapters faithfully capture the target model's internal computation and next-token predictions. The study identifies specific adapter features responsible for generating hesitation tokens, showing that a small subset of features is both necessary and sufficient for this behavior, offering insights into the mechanisms of reasoning training.

Key Contribution

Uncover the surprisingly small fraction of model parameters (as low as 2.4% of adapter features) responsible for specific reasoning behaviors like hesitation token generation, offering a path to targeted model editing.

Abstract

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g., "wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Transcoder Adapters for Reasoning-Model Diffing

Related Papers