March 18 – March 25, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

36 papers published across 1 lab.

3% acceleration

Selected Labs publishing this week

Stanford HAI1

Top Papers

Mar 19, 2026

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 19, 2026

L3S Research Center Leibniz University1w ago·also IIT

Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Skip annotating image rationales: this method transfers text-based rationales to images for explainable crisis classification, saving annotation effort while boosting performance.

Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Haonan Yu +41w ago

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Control LLMs without retraining: pinpointing just a few key neurons lets you steer outputs more reliably than attribution methods.

Haonan Yu, Junhao Liu, Zhenyu Yan +2

Interpretability & Mechanistic Interp Natural Language Processing

1w ago

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Fine-tuning LVLMs on counting alone boosts general visual reasoning by over 1.5%, revealing counting as a surprisingly central skill.

Michelle Hurst

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

All Papers (36)

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 19, 2026

L3S Research Center Leibniz University1w ago·also IIT

Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Skip annotating image rationales: this method transfers text-based rationales to images for explainable crisis classification, saving annotation effort while boosting performance.

Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Haonan Yu +41w ago

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Control LLMs without retraining: pinpointing just a few key neurons lets you steer outputs more reliably than attribution methods.

Haonan Yu, Junhao Liu, Zhenyu Yan +2

Interpretability & Mechanistic Interp Natural Language Processing

1w ago

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Fine-tuning LVLMs on counting alone boosts general visual reasoning by over 1.5%, revealing counting as a surprisingly central skill.

Michelle Hurst

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

Department of Mechanical Engineering1w ago·also Stanford HAI

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

VLAs aren't just memorizing training data; sparse autoencoders reveal a hidden layer of generalizable motion primitives that can be steered to control robot behavior across tasks.

Aiden Swann, Aiden Swann, Lachlain McGranahan +7

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

1w ago

Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Forget comparing models with benchmarks – mapping them by prompt-response likelihoods reveals hidden relationships between architecture, training data, and even how prompts compose.

Yusuke Takase, Yusuke Takase, Momose Oyama +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing+1

Wan-Cyuan Fan +91w ago

Tinted Frames: Question Framing Blinds Vision-Language Models

VLMs selectively ignore visual information based on question framing, even when the visual reasoning task remains identical, highlighting a critical vulnerability in their grounding capabilities.

Wan-Cyuan Fan, Wan-Cyuan Fan, Jiayun Luo +7

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Multimodal Models

INESC TEC1w ago·also U Porto

WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Get faithful and plausible natural language explanations for chest X-rays with as few as 5 human-annotated examples per diagnosis, and even boost classification accuracy in the process.

Isabel Rio-Torto, Jaime S. Cardoso, L. Teixeira +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models+1

Mingxing Zhang +111w ago

SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

Unstable explanations plague ML models on spectroscopy data, but SHAPCA offers a more consistent and interpretable approach by combining PCA and SHAP values in the original input space.

Mingxing Zhang, Mingxin Zhang, Nicola Rossberg +9

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Amanda A. Howard +41w ago

SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks

Unlock the power of interpretable AI: SINDy-KANs distills complex neural networks into sparse equations, revealing the underlying dynamics of systems.

Amanda A. Howard, Nicholas Zolman, Bruno Jacob +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

N. Martorell +11w ago

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

LLMs can introspect on their own internal emotive states during conversations with surprising accuracy, opening a new avenue for monitoring and influencing their behavior.

N. Martorell, Nicolas Martorell

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Bryce Grant +31w ago

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Turns out, VLA models are mostly just looking at the scene: visual pathways dominate action generation, and language only matters when the visuals are ambiguous.

Bryce Grant, Bryce Grant, Xijia Zhao +1

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Corneille Niyonkuru +41w ago

Balancing Performance and Fairness in Explainable AI for Anomaly Detection in Distributed Power Plants Monitoring

You *can* have it all: high-performance anomaly detection, interpretability, and fairness, even in highly imbalanced industrial datasets.

Corneille Niyonkuru, Marcellin Atemkeng, M. Atemkeng +2

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Sijian Fan +51w ago

BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

Uncover hidden relationships in drug discovery: BVSIMC uses Bayesian variable selection to pinpoint the most relevant chemical and genomic features, boosting prediction accuracy and interpretability.

Sijian Fan, Liyan Xiong, Li Xiong +3

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

These authors contributed equally1w ago

Towards Interpretable Foundation Models for Retinal Fundus Images

You can get state-of-the-art performance on retinal fundus image tasks with an interpretable foundation model that's 16x smaller than the alternatives.

Samuel Ofosu Mensah, Maria Camila Roa Carvajal, K. Djoumessi +3

Computer Vision Interpretability & Mechanistic Interp

Anaísa Lucena +41w ago

Fast and Interpretable Autoregressive Estimation with Neural Network Backpropagation

Ditch slow, unstable AR estimation: neural nets offer a 12x speed boost and better convergence, without sacrificing interpretability.

Anaísa Lucena, Ana'isa Lucena, Ana Martins +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Mar 18, 2026

Sofía Aguilar-Valdez +12w ago

Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution

Forget static embeddings: this paper shows how modeling scientific concepts as evolving complex networks reveals surprising connections between conceptual change and network topology.

Sofía Aguilar-Valdez, Stefania Degaetano-Ortlieb

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

Daisuke Yasui +22w ago

Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

Locomotion policies, often considered black boxes, can autonomously learn interpretable phase structures and branching logic, revealing a hidden order in their decision-making.

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

Interpretability & Mechanistic Interp Robotics & Embodied AI

Xianhang Cheng +32w ago

Steering Video Diffusion Transformers with Massive Activations

Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.

Xianhang Cheng, Yujian Zheng, Zhenyu Xie +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

D. Kumaran +72w ago

How do LLMs Compute Verbal Confidence

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

D. Kumaran, Dharshan Kumaran, Arthur Conmy +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Andor Diera +22w ago

Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

LLMs encode hierarchical semantic relations asymmetrically, with hypernymy being far more robust and redundantly represented than hyponymy.

Andor Diera, A. Scherp, Ansgar Scherp

Interpretability & Mechanistic Interp Natural Language Processing Open-Source Models & Weights

Yihong Chen2w ago

Attention Sinks Induce Gradient Sinks

Attention sinks aren't just a forward-pass phenomenon; they actively warp the training landscape by creating "gradient sinks" that drive massive activations.

Yihong Chen

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Saugat Aryal +12w ago

Informative Semi-Factuals for XAI: The Elaborated Explanations that People Prefer

People prefer XAI explanations that tell them *why* a feature change doesn't alter the outcome, not just *that* it doesn't.

Saugat Aryal, Mark T. Keane

Interpretability & Mechanistic Interp Natural Language Processing

Boyong Wu +12w ago·also Munich Center for Machine Learning

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.

Boyong Wu, Zeynep Akata

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Md Maruf Hossain +32w ago

Unsupervised Symbolic Anomaly Detection

Anomaly detection gets a dose of interpretability: SYRAN learns human-readable equations that flag anomalies by violating learned invariants.

Md Maruf Hossain, Tim Katzke, Simon Klüttermann +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

2w ago·also University of California Davis

DebugLM: Learning Traceable Training Data Provenance for LLMs

Pinpointing the training data behind an LLM's behavior is now possible without retraining, opening the door to precise debugging and targeted interventions.

W. Mo, Wenjie Jacky Mo, Qin Liu +4

Data Curation & Synthetic Data Interpretability & Mechanistic Interp

Shih-Heng Wang +62w ago

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Xiutian Zhao +42w ago

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn +2

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Guandong Li +12w ago

Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.

Guandong Li, Zhaobin Chu

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Yaze Zhao +32w ago

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.

Yaze Zhao, Yixiong Zou, Yuhua Li +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Yi Nian +22w ago

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

You can now audit multi-agent LLM systems and trace responsibility for harmful outputs even without access to internal execution logs, thanks to a clever "self-describing text" technique.

Yi Nian, Haosen Cao, Qingqing Luan

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Skeleton-ID2w ago

A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

An AI model can estimate legal age from clavicle CT scans with higher accuracy than human experts, potentially revolutionizing forensic age assessment.

Javier Venema, Stefano De Luca, Pablo Mesejo +1

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Simon Klüttermann +32w ago

FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models

Unlock explainable outlier detection in foundation models with FoMo-X, a modular framework that adds negligible inference overhead while revealing interpretable risk tiers and calibrated confidence measures.

Simon Klüttermann, Tim Katzke, Phuong Huong Nguyen +1

Interpretability & Mechanistic Interp

Alexander Köhler +12w ago

CA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters

Standard PCA, despite its widespread use in CAD, struggles to directly reveal the original design parameters of a geometry, but this paper identifies conditions for accurate parameter estimation.

Alexander Köhler, Michael Breuß

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Yanke Yu +52w ago

Discovering Decoupled Functional Modules in Large Language Models

LLMs aren't monolithic black boxes: they contain spatially organized, functionally specialized modules that can be automatically discovered.

Yanke Yu, Jin Li, Ying Sun +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (36)