Search papers, labs, and topics across Lattice.
This paper introduces ADAG, an automated pipeline for describing attribution graphs in language models, addressing the reliance on manual interpretation in circuit tracing. ADAG uses "attribution profiles" based on input and output gradient effects to quantify feature roles, then clusters features and employs an LLM to generate and score natural-language explanations. The system successfully recovers interpretable circuits on known tasks and identifies steerable clusters responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.
Automating circuit tracing reveals the inner workings of LLMs, even pinpointing the components behind jailbreaks like harmful advice generation in Llama 3.1.
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.