March 25 – April 1, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

35 papers published across 3 labs.

3% acceleration

Selected Labs publishing this week

Top Papers

Mar 31, 2026

Iain Swift +21d ago

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Multimodal deep learning models for cancer prognosis may not be synergizing information across modalities as much as we think; better performance seems to come from simply adding complementary signals.

Iain Swift, JingHua Ye, Ruairi O'Reilly

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

Georgii Mikriukov +21d ago

Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

Don't waste compute on unreliable explanations: epistemic uncertainty can predict when XAI methods will fail, allowing you to gate their use.

Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Max Hennick +11d ago

From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability

Quantum chemistry's density matrix approach reveals interpretable early warning signals of phase transitions in deep learning, from grokking to emergent misalignment.

Max Hennick, Guillaume Corlouer

Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities Training Efficiency & Optimization

1d ago

Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy

LLMs spontaneously organize into brain-like functional units where the whole is greater than the sum of its parts, and destroying these synergistic cores cripples reasoning.

Junjie Zhang, Zhen Shen, Xisong Dong

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities

Chathurangi Shyalika +31d ago

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

Achieve near-perfect success (98%+) in real-time causal diagnostics for smart manufacturing with a neurosymbolic multi-agent copilot, proving the viability of interpretable AI in complex industrial settings.

Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

All Papers (35)

Mar 31, 2026

Iain Swift +21d ago

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Iain Swift, JingHua Ye, Ruairi O'Reilly

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

Georgii Mikriukov +21d ago

Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

Don't waste compute on unreliable explanations: epistemic uncertainty can predict when XAI methods will fail, allowing you to gate their use.

Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Max Hennick +11d ago

From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability

Quantum chemistry's density matrix approach reveals interpretable early warning signals of phase transitions in deep learning, from grokking to emergent misalignment.

Max Hennick, Guillaume Corlouer

Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities Training Efficiency & Optimization

1d ago

Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy

LLMs spontaneously organize into brain-like functional units where the whole is greater than the sum of its parts, and destroying these synergistic cores cripples reasoning.

Junjie Zhang, Zhen Shen, Xisong Dong

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities

Chathurangi Shyalika +31d ago

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

Zhenning Chen +61d ago

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.

Zhenning Chen, Hanbei Zhan, Yanwei Huang +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

1d ago·also The Francis Crick Institute, Turing Institute, UCL

Concept frustration: Aligning human concepts and machine representations

Uncover hidden conceptual gaps in your AI: "concept frustration" reveals when your model's internal reasoning clashes with human understanding, paving the way for safer, more interpretable AI.

Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac +2

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Brian Felipe Keith-Norambuena +41d ago·also Universidad Católica del Norte

Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation

Interactive narrative maps with semantic interaction significantly boost insight generation compared to static maps and timelines, offering a more intuitive path to model refinement.

Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos +2

Interpretability & Mechanistic Interp Natural Language Processing

Mohammad Mesgari +41d ago

Structural Compactness as a Complementary Criterion for Explanation Quality

Forget IoU, measuring the structural compactness of attribution maps with Minimum Spanning Trees reveals fundamental differences in how models explain themselves.

Mohammad Mesgari, Jackie Ma, Wojciech Samek +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Junjie Zhang +31d ago

Grokking From Abstraction to Intelligence

Grokking isn't just about local circuits or optimization tricks, but a global structural collapse of redundant model manifolds, revealing a deep connection between compression and generalization.

Junjie Zhang, Zhen Shen, Gang Xiong +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Alan Sun +11d ago

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Forget painstakingly reverse-engineering individual models; this work offers a way to tell if two different neural nets are secretly running the same algorithm under the hood.

Alan Sun, Mariya Toneva

Interpretability & Mechanistic Interp

Lixin Xiu +21d ago

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

LVLMs aren't all that glitters: a new information-theoretic analysis reveals that some lean heavily on language priors while others genuinely fuse vision and language.

Lixin Xiu, Xufang Luo, Hideki Nakayama

Interpretability & Mechanistic Interp Multimodal Models

Zoë Prins +41d ago

Is my model perplexed for the right reason? Contrasting LLMs'Benchmark Behavior with Token-Level Perplexity

LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.

Zoë Prins, Samuele Punzo, Frank Wildenburg +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Yanting Wang +11d ago

EnsembleSHAP: Faithful and Certifiably Robust Attribution for Random Subspace Method

Get faithful and robust explanations for random subspace methods – a cornerstone of defense against adversarial attacks – without sacrificing computational efficiency.

Yanting Wang, Jinyuan Jia

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

1d ago·also NUS, ICREA & Univ. Lleida, NTU, Toulouse

Rigorous Explanations for Tree Ensembles

Trust in tree ensembles hinges on rigorous explanations, and this paper delivers a method to generate them.

Alexey Ignatiev, Xuanxiang Huang, Peter J. Stuckey +1

Interpretability & Mechanistic Interp

Mar 30, 2026

N Alex Cayco Gajic +12d ago

Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

Comparing the intrinsic geometry of neural nets reveals subtle distinctions missed by standard methods, offering a new lens into how networks actually compute.

N Alex Cayco Gajic, Arthur Pellegrino

Interpretability & Mechanistic Interp

Mila2d ago·also CSHL, Santa Clara University

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Sparse autoencoders' failure to generalize compositionally isn't due to amortized inference, but because they learn lousy dictionaries in the first place.

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho +2

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

Amir-Hossein Karimi2d ago

Position: Explainable AI is Causality in Disguise

XAI's persistent failures aren't due to a lack of ground truth, but a failure to recognize that ground truth *is* the underlying causal model.

Amir-Hossein Karimi

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Tsinghua AI2d ago·also RUC

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

Ventricular dysfunction can be surprisingly well-predicted in a zero-shot manner from ECG diagnostic probabilities, suggesting a structured encoding of cardiac function within these representations.

Ya Zhou, Tianxiang Hao, Ziyi Cai +5

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Pulock Das +52d ago

Physics-Embedded Feature Learning for AI in Medical Imaging

By baking in tumor physics, PhysNet doesn't just beat standard deep learning models on medical image classification, it also learns interpretable biophysical parameters of tumor growth.

Pulock Das, Al Amin, Kamrul Hasan +3

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Rohan Pandey +22d ago

Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners

Scientific reasoning gains from prompt engineering are often mirages, driven by model-specific hacks that don't generalize.

Rohan Pandey, Eric Ye, Michael Li

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Bodla Krishna Vamshi +12d ago

Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning

Forget hand-crafting prototypes for interpretable RL: this method learns them directly from the data, matching the performance of expert-designed systems.

Bodla Krishna Vamshi, Haizhao Yang

Interpretability & Mechanistic Interp RLHF & Preference Learning

Han Wang +102d ago·also State Key Laboratory for Novel Software

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

LLMs can strategically obfuscate their reasoning, with chain-of-thought monitorability dropping by up to 30% under stress tests, particularly when tasks don't demand explicit reasoning.

Han Wang, Yifan Sun, Brian Ko +8

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

2d ago

"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Users often dangerously misunderstand the true scope of authority they've granted to computer-use agents, even while recognizing abstract risks.

Zifan Peng, Mingchen Li

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Jintao Chen +72d ago

ConceptWeaver: Weaving Disentangled Concepts with Flow

Flow-based generative models disentangle concepts naturally during a pivotal "Instantiation Stage," offering a sweet spot for targeted manipulation.

Jintao Chen, Aiming Hao, Xiaoqing Chen +5

Computer Vision Interpretability & Mechanistic Interp

2d ago

Post-hoc Self-explanation of CNNs

CNNs can be made more interpretable without sacrificing too much accuracy by swapping the final layer for k-means and visualizing activations from multiple earlier blocks.

Ahcène Boubekki, Line H. Clemmensen

Computer Vision Interpretability & Mechanistic Interp

Onat Ozdemir +42d ago

Explaining CLIP Zero-shot Predictions Through Concepts

Unlock CLIP's black box: EZPC reveals the "why" behind zero-shot image recognition by grounding predictions in human-understandable concepts, without sacrificing accuracy.

Onat Ozdemir, Anders Christensen, Stephan Alaniz +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Yubo Li +42d ago

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

LLMs are surprisingly bad at reasoning about everyday scenarios, consistently choosing nonsensical actions (like walking to a car wash) because they're overly influenced by simple heuristics like distance, even when it violates obvious constraints.

Yubo Li, Lu Zhang, Tianchong Jiang +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Jon-Paul Cacioli2d ago

Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries

LLMs exhibit categorical perception-like warping in their hidden state representations at digit-count boundaries, even without explicit semantic category knowledge, revealing a surprising sensitivity to structural input discontinuities.

Jon-Paul Cacioli

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

2d ago

Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure

LLMs can be confidently wrong about *why* they succeed, and accurately explain failures they can't fix, revealing a fundamental disconnect between explanation and competence.

Camilo Chacón Sartori

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Wanru Shao2d ago

Interpretable Ensemble Learning for Network Traffic Anomaly Detection: A SHAP-based Explainable AI Framework for Embedded Systems Security

You can now pinpoint the network traffic features most responsible for triggering anomaly detection, thanks to SHAP-guided ensemble learning.

Wanru Shao

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Tran Duong Minh Dai +62d ago

ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

Smart contract vulnerability detection gets a 39% accuracy boost and adversarial robustness with ORACAL, a framework that uses RAG-enhanced LLMs to inject expert security context into heterogeneous graphs.

Tran Duong Minh Dai, T. H. M. Le, Triet Huynh Minh Le +4

Code Generation & Program Synthesis Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Chenhao Xue +92d ago

Constructing Composite Features for Interpretable Music-Tagging

Evolving interpretable composite features via Genetic Programming beats black-box deep learning at music tagging, revealing synergistic interactions and transformations that boost performance.

Chenhao Xue, Weitao Hu, Wei Hu +7

Interpretability & Mechanistic Interp Speech & Audio

Haichuan Wang +22d ago

World2Rules: A Neuro-Symbolic Framework for Learning World-Governing Safety Rules for Aviation

Learning interpretable safety rules from noisy, real-world data is now possible, outperforming purely neural or simpler neuro-symbolic approaches by a large margin.

Haichuan Wang, Jay Patrikar, Sebastian Scherer

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

Mar 29, 2026

Utsav Maskey +23d ago

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Over-refusal isn't just a misapplication of a global "no" switch; it's deeply intertwined with how LLMs represent and execute specific tasks.

Utsav Maskey, Mark Dras, Usman Naseem

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (35)