UMDJun 11, 2026arXiv:2606.12818

Localizing Anchoring Pathways in Language Models

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

AI Summary

This study investigates the anchoring effects of irrelevant numbers in prompts on language model judgments by employing a controlled multiple-choice setup. By defining a logit-difference metric to track behavioral anchoring, the authors demonstrate that edge-level methods are more effective than node-level methods in localizing the anchoring signal within models. The findings reveal that while low- and high-anchor circuits exhibit strong transferability within a model, the transfer across base and instruction-tuned variants is inconsistent, highlighting the impact of post-training modifications on decision pathways.

Key Contribution

Edge-level methods uncover how irrelevant numerical anchors influence language model judgments, revealing shared pathways that shift with model tuning.

Abstract

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Localizing Anchoring Pathways in Language Models

Related Papers