March 4 – March 11, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

83 papers published across 5 labs.

3% acceleration

Selected Labs publishing this week

DeepMind3 Stanford HAI1 Meta AI1 Tsinghua AI1 DAMO1

Top Papers

Mar 5, 2026

Poznan University of Technology3w ago

An interpretable prototype parts-based neural network for medical tabular data

Achieve human-readable interpretability in medical tabular data classification without sacrificing accuracy by learning and comparing against prototypical patient feature subsets.

Jacek Karolczak, Jerzy Stefanowski

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Mar 11, 2026

Maximilian Diehl +43w ago

A Causal Approach to Predicting and Improving Human Perceptions of Social Navigation Robots

Robots can boost their perceived competence by 83% simply by tweaking navigation behaviors suggested by a causal Bayesian network.

Maximilian Diehl, Nathan Tsoi, Gustavo Chávez +2

Interpretability & Mechanistic Interp Robotics & Embodied AI

Gideon Popoola +13w ago

Procedural Fairness via Group Counterfactual Explanation

Achieving fairness doesn't just mean equal outcomes—this work shows how to enforce consistent reasoning across groups by penalizing disparities in counterfactual explanations.

Gideon Popoola, John Sheppard

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

Eirik Høyheim +43w ago

Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Uncover hidden backdoors in your neural networks by tracing the active paths that malicious triggers exploit.

Eirik Høyheim, M. Eckhoff, G. Grov +2

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

3w ago

Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Forget subjective human evaluations: this paper uses a clever knowledge distillation trick to objectively rank XAI methods for NMT, revealing that attention-based attributions beat gradient-based ones.

A. Nourbakhsh, Salima Lamsiyah, Adelaide Danilov +1

Inference & Quantization Interpretability & Mechanistic Interp Natural Language Processing

All Papers (83)

Mar 11, 2026

Maximilian Diehl +43w ago

A Causal Approach to Predicting and Improving Human Perceptions of Social Navigation Robots

Robots can boost their perceived competence by 83% simply by tweaking navigation behaviors suggested by a causal Bayesian network.

Maximilian Diehl, Nathan Tsoi, Gustavo Chávez +2

Interpretability & Mechanistic Interp Robotics & Embodied AI

Gideon Popoola +13w ago

Procedural Fairness via Group Counterfactual Explanation

Achieving fairness doesn't just mean equal outcomes—this work shows how to enforce consistent reasoning across groups by penalizing disparities in counterfactual explanations.

Gideon Popoola, John Sheppard

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

Eirik Høyheim +43w ago

Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Uncover hidden backdoors in your neural networks by tracing the active paths that malicious triggers exploit.

Eirik Høyheim, M. Eckhoff, G. Grov +2

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

3w ago

Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

A. Nourbakhsh, Salima Lamsiyah, Adelaide Danilov +1

Inference & Quantization Interpretability & Mechanistic Interp Natural Language Processing

3w ago

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Uncover the hidden causal chains inside your LLM with Causal Concept Graphs, which outperform existing methods for reasoning by explicitly modeling concept dependencies.

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

3w ago

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Speech deepfake detection gets a reasoning upgrade: HIR-SDD uses chain-of-thought prompting with Large Audio Language Models to not only detect fakes but also explain *why* it thinks they're fake.

Artem Dvirniak, E. Kushnir, Dmitrii Tarasov +5

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also XJTU

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Clinicians using HeartAgent, a cardiology-specific agent system, improved diagnostic accuracy by 26.9% and explanatory quality by 22.7% compared to unaided experts.

Shuang Zhou, Kai Yu, Song Wang +10

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

DeepMind3w ago

Taking Shortcuts for Categorical VQA Using Super Neurons

Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.

Pierre Musacchio, Jaeyi Jeong, Dahun Kim

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Weihang Huang +13w ago

Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Chinese metaphor identification is highly sensitive to the choice of protocol, dwarfing the impact of model-level variations, yet can be tackled with fully transparent, LLM-assisted rule scripts.

Weihang Huang, Mengna Liu

Interpretability & Mechanistic Interp Natural Language Processing

3w ago

Prism-$\Delta$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Prompt highlighting in LLMs gets a serious upgrade: PRISM-$\Delta$ steers models to focus on relevant text spans with better accuracy and fluency, even in long contexts.

Yuyao Ge, Shenghua Liu, Yiwei Wang +6

Interpretability & Mechanistic Interp Natural Language Processing

Yangyang Qu +33w ago

Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics

Fair-Gate disentangles speaker identity and sex in voice biometrics, boosting fairness without sacrificing accuracy by explicitly routing features through identity and sex-specific pathways.

Yangyang Qu, Todisco Massimiliano, Galdi Chiara +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Speech & Audio

3w ago·also MBZUAI, Provable Responsible AI and Data

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

LLMs possess a "word recovery" mechanism that allows them to reconstruct canonical word-level tokens from character-level inputs, explaining their surprising robustness to non-canonical tokenization.

Zhipeng Yang, Shu Yang, Lijie Hu +1

Interpretability & Mechanistic Interp Natural Language Processing

Mar 10, 2026

Shivam Raval +53w ago

Curveball Steering: The Right Direction To Steer Isn't Always Linear

LLM activation spaces aren't linear, and exploiting their true geometry with "Curveball steering" unlocks more effective control than standard linear interventions.

Shivam Raval, Hae Jin Song, Linlin Wu +3

Interpretability & Mechanistic Interp

3w ago

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Forget interference as just noise: correlated features in neural networks can constructively superpose to form semantic clusters, especially with weight decay.

Lucas Prieto, Edward Stevinson, Melih Barsbey +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

University of Bergen3w ago·also Radboud, TU Delft, University of Zagreb

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Backdoor defenses focused on removing training triggers are fundamentally flawed, as alternative, perceptually distinct triggers can reliably activate the same backdoor via a latent feature-space direction.

Gorka Abad, Ermes Franch, Stefanos Koffas +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

3w ago

BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling

BrainSTR disentangles subtle disease signatures in dynamic brain networks by explicitly modeling spatio-temporal dependencies with contrastive learning, revealing interpretable biomarkers for neuropsychiatric disorders.

Guiliang Guo, Guangqi Wen, Lingwen Liu +4

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Alberto Fernández-Hernández +53w ago

When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

Forget return curves – a simple measure of neuron activation patterns (OUI) at just 10% of training can predict PPO performance better than existing methods, enabling early pruning of bad runs.

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre +3

Interpretability & Mechanistic Interp Training Efficiency & Optimization

DeepMind3w ago

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Forget black-box policies: CSRO uses LLMs to generate human-readable code policies in multi-agent RL, achieving performance competitive with traditional methods.

Daniel Hennes, Zun Li, John Schultz +1

Code Generation & Program Synthesis Interpretability & Mechanistic Interp Tool Use & Agents

Benjamin Z. Reichman +33w ago

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

LLMs' attention patterns subtly shift with emotional tone, and explicitly accounting for these shifts during training improves reading comprehension even on neutral datasets.

Benjamin Z. Reichman, Adar Avasian, Samuel Webster +1

Interpretability & Mechanistic Interp Natural Language Processing Reasoning & Chain-of-Thought

Isabelle Augenstein3w ago

Understanding the Interplay between LLMs' Utilisation of Parametric and Contextual Knowledge: A keynote at ECIR 2025

Language models often disregard provided context, choosing instead to rely on potentially outdated or conflicting information learned during pre-training, revealing a critical flaw in their knowledge integration.

Isabelle Augenstein

Interpretability & Mechanistic Interp Recommendation & Information Retrieval Scaling Laws & Emergent Abilities

Robin Hesse +33w ago

What is Missing? Explaining Neurons Activated by Absent Concepts

DNN neurons often fire *more* strongly when a concept is missing, revealing a blind spot in standard XAI methods that can now be addressed.

Robin Hesse, Simone Schaub-Meyer, Janina Hesse +1

Interpretability & Mechanistic Interp

DeepMind3w ago

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Mixture-of-Experts models might be hiding more of their reasoning than we thought, thanks to a newly quantified "opaque serial depth" metric.

Jonah Brown-Cohen, David Lindner, Rohin Shah

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 9, 2026

Romain Loncour +33w ago

Sensivity of LLMs'Explanations to the Training Randomness:Context, Class&Task Dependencies

LLM explanations are far more sensitive to the task being performed than the context or learned classes, highlighting a critical instability in current interpretability methods.

Romain Loncour, Jérémie Bogaert, François-Xavier Standaert +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Charité3w ago·also Berlin Institute of Health, German Cancer Consortium (DKTK), Heidelberg, TU Berlin

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Attention heatmaps in MIL models for histopathology are often misleading, and simpler methods like perturbation or LRP provide more faithful explanations.

Mina Jamshidi Idaji, Julius Hense, Tom Neuhäuser +12

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Simone Carnemolla +63w ago·also Munich Center for Machine Learning

UNBOX: Unveiling Black-box visual models with Natural-language

You can now audit black-box vision models for biases and failure modes using only their output probabilities, thanks to a clever LLM-powered semantic search.

Simone Carnemolla, Chiara Russo, Simone Palazzo +4

Computer Vision Interpretability & Mechanistic Interp

3w ago

Explainable Condition Monitoring via Probabilistic Anomaly Detection Applied to Helicopter Transmissions

Detect anomalies in complex systems with a novel explainable condition monitoring methodology that learns from healthy data alone, offering competitive performance and enhanced interpretability for safety-critical applications.

Aurelio Raffa Ugolini, Jessica Leoni, Valentina Breschi +4

Interpretability & Mechanistic Interp Robotics & Embodied AI Scientific Discovery & Drug Design

3w ago·also Beihang, JKU, Meituan, PKU

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.

Dengcan Liu, Fengkai Yang, Xiaohan Wang +6

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Baoxue Li +13w ago

S2S-FDD: Bridging Industrial Time Series and Natural Language for Explainable Zero-shot Fault Diagnosis

Get zero-shot, explainable fault diagnoses from your industrial time series data by translating sensor signals into natural language that LLMs can understand.

Baoxue Li, Chunhui Zhao

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

University of Ottawa3w ago

Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

LLaMA and Gemma may seem to understand complex conditional statements, but they're really just pattern-matching, not grasping the underlying pragmatic nuances of presuppositions.

Tara Azin, D. Dumitrescu, Daniel Dumitrescu +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Liuyi Xu +73w ago

CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

LLMs can now safely navigate the complexities of acupuncture clinical decision support, thanks to a neuro-symbolic framework that slashes safety violations from 8.5% to zero.

Liuyi Xu, Yun Guo, Ming Chen +5

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Mohammad Abboush +23w ago

An explainable hybrid deep learning-enabled intelligent fault detection and diagnosis approach for automotive software systems validation

Stop blindly trusting your fault detection models: this hybrid CNN-GRU approach uses explainable AI to reveal the reasoning behind its predictions, enabling adaptation and root cause analysis in automotive software validation.

Mohammad Abboush, Ehab Ghannoum, Andreas Rausch

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

Marcin Kostrzewa +23w ago·also Tooploox Sp. z o.o, Wroclaw University of Science and Technology, Wrocław University of Science and Technology

Towards plausibility in time series counterfactual explanations

Time series counterfactual explanations can now be more realistic thanks to a novel soft-DTW-based approach that ensures temporal structure.

Marcin Kostrzewa, Krzysztof Galus, Maciej Zięba

Interpretability & Mechanistic Interp

Junhao Jia +33w ago

This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse

By representing prototypes as orthonormal bases on the Stiefel manifold, this work makes prototype collapse infeasible by construction, leading to more interpretable and accurate image recognition.

Junhao Jia, Yunyou Liu, Haodong Jing +1

Computer Vision Interpretability & Mechanistic Interp

Simon Bing +23w ago

Structural Causal Bottleneck Models

Causal effects between high-dimensional variables may be simpler than you think: they often depend only on low-dimensional summary statistics, or bottlenecks, of the causes.

Simon Bing, Jonas Wahl, Jakob Runge

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Jesús Sánchez Ochoa +23w ago

SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding

Uncover hidden vulnerabilities in Transformer models with SYNAPSE, a training-free framework that reveals how small manipulations can redirect predictions despite the redundancy of task-relevant information encoded in broad neuron subsets.

Jesús Sánchez Ochoa, Enrique Tomás Martínez Beltrán, Alberto Huertas Celdrán

Interpretability & Mechanistic Interp Natural Language Processing

Drexel University3w ago·also University of California, Washington University

BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment

By explicitly modeling joint mechanics with language-aligned tokens, BioGait-VLM prevents gait analysis models from overfitting to visual shortcuts and unlocks improved generalization and interpretability.

Erdong Chen, Yuyang Ji, Jacob K. Greenberg +5

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Sripad Karne3w ago

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

LLMs represent meaning more abstractly than previously thought: changing the script of a sentence (Latin vs. Cyrillic) causes less representational divergence than paraphrasing it within the same script.

Sripad Karne

Interpretability & Mechanistic Interp Natural Language Processing

Islamic Azad university3w ago

TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction

Achieve more accurate and interpretable mortality risk predictions in ICUs by explicitly modeling irregular temporal dynamics and integrating standardized medical knowledge into time-aware RNNs.

Zahra Jafari, A. Zamanifar, Azadeh Zamanifar +1

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

Mar 8, 2026

3w ago

The Effect of Code Obfuscation on Human Program Comprehension

Code obfuscation doesn't always make things harder for humans: certain renaming techniques in Python can actually *improve* program comprehension compared to the original code.

Anh H. N. Nguyen, Jack Le, Ilse Lahnstein Coronado +1

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

Sudhanshu Agrawal +23w ago

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

Diffusion language models have surprisingly redundant early layers, enabling nearly 20% FLOPs reduction at inference time via layer skipping without sacrificing performance.

Sudhanshu Agrawal, Chris Lott, Fatih Porikli

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Interpretability & Mechanistic Interp

Independent Researcher Ljubljana3w ago

Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

Whitening the embedding space of GPT-2-small exposes cluster commitment as the key geometric property separating different types of language model hallucinations.

Matic Korun

Interpretability & Mechanistic Interp Natural Language Processing

J. Clayton Kerce +13w ago

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

Achieve transformer interpretability by disentangling token and context processing streams, with only a 2.5% performance hit using Kronecker mixing.

J. Clayton Kerce, Alexis Fox

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Mar 6, 2026

Neta Glazer +23w ago

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

Forget retraining: Steering a handful of attention heads in audio-language models can boost audio understanding by 8%, revealing a surprisingly simple way to overcome text dominance.

Neta Glazer, Lenny Aharon, Ethan Fetaya

Interpretability & Mechanistic Interp Multimodal Models Speech & Audio

Nandan Kumar Jha +13w ago

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

LLM feed-forward networks have hidden spectral signatures that predict generalization and respond predictably to design choices, opening the door to more principled architecture and optimizer selection.

Nandan Kumar Jha, Brandon Reagen

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Mar 5, 2026

Zuo Fei +33w ago

BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

BioLLMAgent bridges the gap between interpretable but unrealistic RL models and realistic but opaque LLM agents, offering a "computational sandbox" for testing psychiatric hypotheses.

Zuo Fei, Kezhi Wang, Xiaomin Chen +1

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design Tool Use & Agents

A. Văduva +53w ago

Measuring the Fragility of Trust: Devising Credibility Index via Explanation Stability (CIES) for Business Decision Support Systems

A "credibility warning system" for AI-driven business decisions is now possible, thanks to a new metric that reveals how much explanations wobble when the data shifts.

A. Văduva, Alin-Gabriel Vaduva, S. Oprea +3

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Harvey Lederman +13w ago

Dissociating Direct Access from Inference in AI Introspection

AI models can detect injected thoughts, but they often have no idea *what* those thoughts are, relying on content-agnostic anomaly detection and then guessing common concepts.

Harvey Lederman, Kyle Mahowald

Interpretability & Mechanistic Interp Open-Source Models & Weights

Siddharth Boppana +103w ago

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

LLMs often know the answer long before their "reasoning" suggests, wasting tokens on performative chain-of-thought.

Siddharth Boppana, S. Boppana, An-gelos Ma +8

Interpretability & Mechanistic Interp Open-Source Models & Weights Reasoning & Chain-of-Thought

3w ago

Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions

Algorithmic decisions about humans can now be audited for "Representation Fidelity" by checking if they align with self-reported descriptions, revealing potential biases and inaccuracies.

Theresa Elstner, Martin Potthast

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Ruichen Xu +33w ago

Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

Transformers perform analogical reasoning by aligning feature representations of similar entities, but only if trained with the right curriculum.

Ruichen Xu, Wenjing Yan, Yingjie Zhang +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Laura Spillner +43w ago

Not All Trust is the Same: Effects of Decision Workflow and Explanations in Human-AI Decision Making

The common belief that a two-step decision workflow reduces overreliance on AI advice doesn't hold up, and the effectiveness of explanations hinges on the specific workflow and user expertise.

Laura Spillner, Rachel Ringe, R. Porzel +2

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Stanford HAI3w ago·also Department of Mechanical Engineering

Observing and Controlling Features in Vision-Language-Action Models

Forget retraining: you can steer a robot's behavior in real-time by nudging its internal representations with lightweight, targeted interventions.

Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann +1

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Poznan University of Technology3w ago

An interpretable prototype parts-based neural network for medical tabular data

Achieve human-readable interpretability in medical tabular data classification without sacrificing accuracy by learning and comparing against prototypical patient feature subsets.

Jacek Karolczak, Jerzy Stefanowski

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Ji-in Jeong +13w ago

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

AI models are more like patients than black boxes: "Model Medicine" offers a clinical framework and open-source tools to diagnose and treat their "ailments."

Ji-in Jeong, Jihoon Jeong

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Shailesh Sridhar +23w ago

Missingness Bias Calibration in Feature Attribution Explanations

Forget retraining or complex architectures: a simple linear head can effectively eliminate missingness bias in feature attribution, rivaling heavyweight methods.

Shailesh Sridhar, Anton Xue, Eric Wong

Interpretability & Mechanistic Interp

Alper Yıldırım +13w ago

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

By constraining Transformer architectures to have bounded representations and uniform attention, grokking can be bypassed entirely for modular addition, suggesting task-specific geometric alignment is key.

Alper Yıldırım, Alper Yildirim

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Hajar Dekdegue +43w ago

Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

Achieve more robust and informative visual explanations for CNNs by adaptively fusing gradient-based and region-based CAM methods, outperforming existing approaches on standard benchmarks.

Hajar Dekdegue, Moncef Garouani, Josiane Mothe +2

Computer Vision Interpretability & Mechanistic Interp

Ambroise Odonnat +93w ago

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Forget probing transformer block outputs – the *real* OOD performance gains in ViTs come from selectively probing feedforward network activations or self-attention outputs depending on the severity of the distribution shift.

Ambroise Odonnat, Ambroise Odonnat, Vasilii Feofanov +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

3w ago·also TTIC

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Hallucinations in VLMs can be predicted *before* any text is generated, opening the door to early intervention and more efficient, safer models.

Sai Akhil Kogilathota, Sripadha Vallabha, Sripadha Vallabha E G +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Multimodal Models

Ce Zhang +43w ago

Axiomatic On-Manifold Shapley via Optimal Generative Flows

Escape the curse of off-manifold Shapley values: this new method leverages optimal generative flows to produce attributions that actually respect the data manifold.

Ce Zhang, Cenwei Zhang, Lin Zhu +2

Interpretability & Mechanistic Interp

Riddhasree Bhattacharyya +23w ago

Adaptive Prototype-based Interpretable Grading of Prostate Cancer

Prototype-based deep learning offers a more trustworthy approach to prostate cancer grading by mirroring a pathologist's workflow of comparing suspicious regions with clinically validated examples.

Riddhasree Bhattacharyya, Pallabi Dutta, Sushmita Mitra

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

M show up to 21.3w ago·also PKU, XJTU

MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus

Achieve expert-level hepatology diagnosis by mimicking multidisciplinary consultation, using an AI system that combines knowledge graph reasoning, clinical guidelines, and a multi-agent system for traceable consensus.

Zheng Li, Jiayi Xu, Zhikai Hu +4

Interpretability & Mechanistic Interp Recommendation & Information Retrieval Scientific Discovery & Drug Design

Muhammad Zarar +63w ago

Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule

Finally, a PAR framework that doesn't just classify patient activities, but tells you *why* a set of visual cues implies a risk, complete with auditable rule traces and counterfactual interventions.

Muhammad Zarar, Mingzhe Zhang, MingZheng Zhang +4

Computer Vision Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mark A. van de Wiel +63w ago

How important are the genes to explain the outcome - the asymmetric Shapley value as an honest importance metric for high-dimensional features

Asymmetric Shapley values offer a more robust and interpretable approach to feature importance in clinical prediction by accounting for collinearity and known directional dependencies, overcoming limitations of traditional methods.

Mark A. van de Wiel, M. V. D. Wiel, Jeroen M. Goedhart +4

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

3w ago·also Athena Research Center, Paris Cité

GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering

Finally, a unified framework illuminates the "what if" transitions between time-series clusters, using counterfactual explanations to reveal the minimal perturbations that shift a time-series from one cluster to another.

Christos Fragkathoulas, Eleni Psaroudaki, Themis Palpanas +2

Interpretability & Mechanistic Interp Natural Language Processing

Meta AI3w ago·also NYU

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Pre-normalization in Transformers is the culprit behind the mysterious link between massive activation outliers and attention sinks, but decoupling them reveals their distinct functions: global parameterization vs. local attention modulation.

Shangwen Sun, A. Canziani, Alfredo Canziani +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Gao Tianxi +53w ago

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

LLMs struggle more with restructuring solution spaces than refining constraints, revealing a key asymmetry in their reasoning abilities that standard benchmarks miss.

Gao Tianxi, Tian Gao, Cai Yufan +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

3w ago

Voice Timbre Attribute Detection with Compact and Interpretable Training-Free Acoustic Parameters

Surprisingly, a compact, training-free set of acoustic parameters rivals DNN embeddings and approaches self-supervised models in voice timbre attribute detection, offering interpretability and efficiency.

Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong +1

Interpretability & Mechanistic Interp Speech & Audio

3w ago·also DLR

SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty

Robots can nimbly switch between autonomous and teleoperated modes based on the confidence of their learned perception, leading to more reliable manipulation.

Jongseo Lee, Jongseok Lee, R. Balachandran +11

Computer Vision Interpretability & Mechanistic Interp Robotics & Embodied AI

Mar 4, 2026

Mar 4, 2026·also TU Munich

Beyond Edge Deletion: A Comprehensive Approach to Counterfactual Explanation in Graph Neural Networks

Forget just deleting edges: XPlore uses gradients to intelligently tweak node features *and* add edges, unlocking more valid and faithful counterfactual explanations for GNNs.

Matteo De Sanctis, Riccardo De Sanctis, Stefano Faralli +2

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Qianpu Chen +2Mar 4, 2026

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Face pareidolia reveals that a vision model's behavior under ambiguity is governed more by representational choices than score thresholds, and that low uncertainty can signal either safe suppression or extreme over-interpretation.

Qianpu Chen, Derya Soydaner, Rob Saunders

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Maximilian von Klinski +1Mar 4, 2026

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

TaxonRL doesn't just beat humans at bird identification; it shows its work, revealing a transparent reasoning process that could revolutionize how we trust AI in complex visual tasks.

Maximilian von Klinski, Maximilian Schall

Computer Vision Interpretability & Mechanistic Interp RLHF & Preference Learning

Stephan Ludwig +1Mar 4, 2026

A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research

LLMs can be harnessed to refine neural topic models, yielding substantial gains in topic quality and interpretability without sacrificing document representation accuracy.

Stephan Ludwig, Peter J. Danaher

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Piotr Jedryszek +1Mar 4, 2026

Stable and Steerable Sparse Autoencoders with Weight Regularization

L2 weight regularization unlocks stable and steerable sparse autoencoders, doubling steering success rates and aligning feature explanations with functional controllability.

Piotr Jedryszek, Oliver M. Crook

Interpretability & Mechanistic Interp Training Efficiency & Optimization

LTCIMar 4, 2026·also S2A

Exploiting Subgradient Sparsity in Max-Plus Neural Networks

Max-Plus networks, despite their interpretability, can be efficiently trained by exploiting the algebraic sparsity of their subgradients, leading to faster updates.

Ikhlas Enaieh, Olivier Fercoq

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

E. Barenholtz +1Mar 4, 2026

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Static word embeddings like GloVe and Word2Vec can achieve surprisingly high accuracy (R^2 up to 0.87) in recovering geographic and temporal information, challenging the interpretation of similar findings in LLMs as evidence of complex world models.

E. Barenholtz, Elan Barenholtz

Interpretability & Mechanistic Interp Natural Language Processing World Models & Planning

Mar 4, 2026·also Mayo Clinic Florida, Medical University of South Carolina, UMD, University Hospitals Cleveland Medical +2

CRESTomics: Analyzing Carotid Plaques in the CREST-2 Trial with a New Additive Classification Model

A new additive classification model reveals that plaque texture, as assessed by ultrasound radiomics, is strongly associated with stroke risk, offering a non-invasive marker for improved patient stratification.

Pranav Kulkarni, Brajesh K. Lal, Georges Jreij +11

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

and for the Alzheimer's DiseaseMar 4, 2026

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

Demographic biases in brain MRI stem primarily from anatomical variations, not just acquisition-dependent contrast, challenging assumptions about bias mitigation strategies.

Mehmet Yigit Avci, Akshit Achara, Andrew King +1

Computer Vision Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Mar 4, 2026·also TU Munich

A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

Agentic AI can actually *hurt* explanation quality for sophisticated "thinking" models analyzing physiological data, challenging the assumption that more complex reasoning always leads to better clinical insights.

Davide Gabrielli, Paola Velardi, Stefano Faralli +1

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Mar 4, 2026·also Technical University of Berlin

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Forget inspecting final outputs: LLMs telegraph their reward-hacking intentions internally, early in the generation process, via distinctive activation patterns.

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness RLHF & Preference Learning

M. Sagitova +2Mar 4, 2026

Specialization of softmax attention heads: insights from the high-dimensional single-location model

Softmax attention heads specialize in stages during training, and a novel Bayes-softmax attention can achieve optimal prediction performance by reducing noise from irrelevant heads.

M. Sagitova, O. Duranthon, L. Zdeborová

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Hiroki Tomioka +1Mar 4, 2026

PatchDecomp: Interpretable Patch-Based Time Series Forecasting

Finally, a forecasting model that's as accurate as the black boxes but actually tells you *why* it made that prediction.

Hiroki Tomioka, Genta Yoshimura

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Tsinghua AIMar 4, 2026·also DAMO

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.

Chufan Shi, Yizhen Zhang, Ruizhe Chen +3

Computer Vision Interpretability & Mechanistic Interp Multimodal Models+1

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (83)