April 24 – May 1, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

33 papers published across 2 labs.

Selected Labs publishing this week

Top Papers

Apr 30, 2026

3w ago·also Assistance Publique Hôpitaux de Paris, Department of Medical Informatics, Georges Pompidou European Hospital, INRIA +2

Differentiable latent structure discovery for interpretable forecasting in clinical time series

Unlocking interpretable clinical forecasting: StructGP recovers causal relationships and patient progression patterns directly from irregular EHR data, outperforming black-box methods in accuracy and uncertainty calibration.

Ivan Lerner, I. Lerner, Jean Feydy +4

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

3w ago

The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text

TEA Nets reveal that LLMs express sadness with lower emotional intensity than humans in psychotherapy contexts, highlighting potential limitations in their ability to simulate genuine emotional responses.

Sebastiano Franchini, Sebastián Franchini, Alexis Carrillo +6

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Hongliang Liu +23w ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.

Hongliang Liu, Tung-Ling Li, Yuhao Wu

Interpretability & Mechanistic Interp RLHF & Preference Learning

3w ago

Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.

Gurucharan Srinivas, G. Srinivas, J. Niemeijer +3

Computer Vision Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Tsinghua AI3w ago

DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

LLMs can have their personalities surgically altered by tweaking just 0.5% of their neurons, preserving general capabilities while achieving competitive control.

Lifan Zheng, Xue Yang, Jiawei Chen +5

Interpretability & Mechanistic Interp Natural Language Processing

All Papers (33)

Apr 30, 2026

3w ago·also Assistance Publique Hôpitaux de Paris, Department of Medical Informatics, Georges Pompidou European Hospital, INRIA +2

Differentiable latent structure discovery for interpretable forecasting in clinical time series

Ivan Lerner, I. Lerner, Jean Feydy +4

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

3w ago

The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text

Sebastiano Franchini, Sebastián Franchini, Alexis Carrillo +6

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Hongliang Liu +23w ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.

Hongliang Liu, Tung-Ling Li, Yuhao Wu

Interpretability & Mechanistic Interp RLHF & Preference Learning

3w ago

Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.

Gurucharan Srinivas, G. Srinivas, J. Niemeijer +3

Computer Vision Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Tsinghua AI3w ago

DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

LLMs can have their personalities surgically altered by tweaking just 0.5% of their neurons, preserving general capabilities while achieving competitive control.

Lifan Zheng, Xue Yang, Jiawei Chen +5

Interpretability & Mechanistic Interp Natural Language Processing

NTT Human Informatics Laboratories3w ago

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Kaixiang Shu3w ago

Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers

CNN classifiers don't just select from cleaned features, they actively cancel out shared background information via destructive interference, rewriting our understanding of how these networks actually "see".

Kaixiang Shu

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

3w ago

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

TSFMs can achieve competitive forecasting performance in critical infrastructure applications while also providing interpretable explanations that align with established domain knowledge.

Matthias Hertel, Matthias Hertel, Alexandra Nikoltchovska +9

Interpretability & Mechanistic Interp

3w ago·also Stanford HAI, Northeastern, UCL

Do Sparse Autoencoders Capture Concept Manifolds?

Sparse autoencoders, despite their popularity for extracting interpretable features, often fail to capture the underlying manifold structure of concepts, instead fragmenting them across multiple, diluted features.

Usha Bhalla, Usha Bhalla, Thomas Fel +21

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Sigma Jahan +63w ago

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Pinpointing the root cause of transformer failures just got a whole lot easier: DEFault++ can detect, categorize, and diagnose faults with high accuracy, even down to specific mechanisms.

Sigma Jahan, Sigma Jahan, Saurabhsingh Rajput +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Sascha Xu +23w ago·also Helmholtz

Differential Subgroup Discovery: Characterizing Where Two Populations Differ, and Why

Uncover hidden drivers of disparity: pinpoint the specific combinations of characteristics that explain outcome gaps between populations.

Sascha Xu, J. Vreeken, Jilles Vreeken

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

Prashant Kulkarni +13w ago

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

LLMs betray prompt injection attacks with a tell-tale "restlessness" in their activation trajectories, detectable even when individual turns appear harmless.

Prashant Kulkarni, Prashant Kulkarni

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Matteo Da Pelo +63w ago·also University of Cagliari, University of Salerno

Taming the Centaur(s) with LAPITHS: a framework for a theoretically grounded interpretation of AI performances

Claims of human-like cognition in models like CENTAUR crumble under LAPITHS, a framework that reveals these models' performance can be replicated by systems lacking cognitive plausibility.

Matteo Da Pelo, Alessio Donvito, Claudio Frongia +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 29, 2026

3w ago·also North South university, QMUL

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Howard University3w ago·also Adobe Research

FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

Texture, not color, is the secret sauce behind fashion house identity, revealed by probing a multimodal CNN trained on decades of Vogue runway images.

Morayo Danielle Adeyemi, Ryan A. Rossi, Ryan A. Rossi +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

3w ago

Explaining the "Why": A Unified Framework for the Additive Attribution of Changes in Arbitrary Measures

Uncover the hidden drivers behind your KPIs: a new attribution framework finally explains *why* your metrics move, not just *what* changed.

Changsheng Zhou, Dajun Chen, Zhitao Shen +4

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

3w ago·also UMich

Semantic Structure of Feature Space in Large Language Models

LLMs aren't just memorizing words; they're organizing them in a feature space that mirrors the nuanced semantic relationships humans perceive.

Austin C. Kozlowski, Andrei Boutyline

Interpretability & Mechanistic Interp Natural Language Processing

3w ago·also Heriot-Watt University

MoRFI: Monotonic Sparse Autoencoder Feature Identification

LLMs' factual recall falters when fine-tuned on new information, and this can be traced to specific latent directions in the residual stream.

Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas

Interpretability & Mechanistic Interp

University of Missouri -Columbia3w ago

Formulating Subgroup Discovery as a Quantum Optimization Problem for Network Security

Quantum computing can surface critical network attack patterns that classical methods miss, achieving up to 99.6% test precision on unique subgroups.

Samuel Spell, Chi-Ren Shyu

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Apr 28, 2026

3w ago·also Barcelona Supercomputing Center (BSC), Departament de Física Quàntica i Astrofísica, ICREA, Institut de Ciències del Cosmos

Towards interpretable AI with quantum annealing feature selection

Quantum annealing offers a surprisingly effective route to interpretable AI, outperforming standard gradient-based methods in disentangling CNN decision boundaries.

Francesco Aldo Venturelli, Emanuele Costa, Sikha O K +3

Computer Vision Interpretability & Mechanistic Interp

IIT3w ago

Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane

GNNs tagging jets at the LHC aren't black boxes: explainability methods reveal they learn physically meaningful features of QCD, with performance varying predictably across energy regimes.

Pahal D. Patel, Sanmay Ganguly

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Josué Obregon3w ago

RCProb: Probabilistic Rule Extraction for Efficient Simplification of Tree Ensembles

Rule extraction from tree ensembles just got 22x faster, without sacrificing accuracy or interpretability.

Josué Obregon

Inference & Quantization Interpretability & Mechanistic Interp

Yongzhong Xu3w ago

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

Forget what you thought you knew about how models learn: analyzing loss gradients, not just parameter updates, reveals a hidden order of magnitude increase in the coupling between learned features and parameter space.

Yongzhong Xu

Interpretability & Mechanistic Interp Training Efficiency & Optimization

3w ago

From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLMs

LLMs process emotions in three distinct phases, but some emotions like Disgust are represented far more weakly and diffusely than others.

Bangzhao Shu, Arinjay Singh, Mai ElSherief

Interpretability & Mechanistic Interp Natural Language Processing

Ali Karkehabadi +33w ago

SaliencyDecor: Enhancing Neural Network Interpretability through Feature Decorrelation

Feature decorrelation during training not only sharpens saliency maps, but also *improves* model accuracy, challenging the conventional wisdom that interpretability comes at the cost of performance.

Ali Karkehabadi, Jamshid Hassanpour, H. Homayoun +1

Computer Vision Interpretability & Mechanistic Interp

Apr 27, 2026

Chandler Squires +33w ago

A Unifying Framework for Unsupervised Concept Extraction

Concept extraction's identifiability problem just got a lot easier, thanks to a new framework that turns guarantee proofs into set intersection problems.

Chandler Squires, C. Squires, Pradeep Ravikumar +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

3w ago·also DFKI

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.

Dan Shi, S. Ostermann, Renren Jin +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought RLHF & Preference Learning

Nay Myat Min +23w ago

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

A single, tuning-free "health signal" derived from layer activations can catch backdoors, jailbreaks, and prompt injections in LLMs, even without a clean reference model.

Nay Myat Min, Long H. Pham, Jun Sun

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Brandon Hsu +73w ago

Contextual Linear Activation Steering of Language Models

Forget fixed steering strengths - CLAS dynamically adapts steering based on context, unlocking more consistent and powerful control over LLM behavior.

Brandon Hsu, Brandon Hsu, Daniel Beaglehole +5

Interpretability & Mechanistic Interp Natural Language Processing

Zhuoling Li +33w ago

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

GraphRAG's black-box reasoning gets a spotlight: XGRAG reveals how specific knowledge graph components influence LLM outputs, boosting explanation quality by 14.81% over standard RAG explainability methods.

Zhuoling Li, Ha Nguyen, Valeria Bladinieres +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Recommendation & Information Retrieval

These authors contributed equally to3w ago·also Flatiron, NYU

Learning biophysical models of gene regulation with probability flow matching

Biophysically-constrained models of gene regulation, learned via probability flow matching, are the only ones that accurately predict cell fate decisions and responses to perturbations, even when other models interpolate the training data just as well.

Suryanarayana Maddu, S. Maddu, Victor Chardès +3

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Tanmoy Mukherjee +33w ago

Credal Concept Bottleneck Models for Epistemic-Aleatoric Uncertainty Decomposition

Concept bottleneck models can now distinguish between reducible model uncertainty and irreducible input ambiguity, enabling targeted interventions like data collection and human review.

Tanmoy Mukherjee, Thomas Bailleux, Pierre Marquis +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Apr 25, 2026

Chathurangi Shyalika +23w ago

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Neurosymbolic grounding of LLMs in telemetry and knowledge graphs slashes expert-rated overclaims in industrial maintenance explanations by 93%, making AI assistants far more trustworthy in safety-critical settings.

Chathurangi Shyalika, Dhaval Patel, Amit P. Sheth

Interpretability & Mechanistic Interp Natural Language Processing Robotics & Embodied AI+1

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (33)