March 11 – March 18, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

96 papers published across 4 labs.

3% acceleration

Selected Labs publishing this week

Tsinghua AI2 Stanford HAI1 MIT CSAIL1 DeepMind1

Top Papers

Mar 12, 2026

School of Computer Science and Electronic Engineering University of Essex2w ago

Interpreting Contrastive Embeddings in Specific Domains with Fuzzy Rules

Fuzzy rules reveal how CLIP encodes domain-specific features in clinical reports and film reviews, offering a peek inside the black box of multimodal embeddings.

Javier Fumanal-Idocin, Mohammadreza Jamalifard, Javier Andreu-Perez

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Mar 18, 2026

Sofía Aguilar-Valdez +12w ago

Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution

Forget static embeddings: this paper shows how modeling scientific concepts as evolving complex networks reveals surprising connections between conceptual change and network topology.

Sofía Aguilar-Valdez, Stefania Degaetano-Ortlieb

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

Daisuke Yasui +22w ago

Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

Locomotion policies, often considered black boxes, can autonomously learn interpretable phase structures and branching logic, revealing a hidden order in their decision-making.

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

Interpretability & Mechanistic Interp Robotics & Embodied AI

Xianhang Cheng +32w ago

Steering Video Diffusion Transformers with Massive Activations

Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.

Xianhang Cheng, Yujian Zheng, Zhenyu Xie +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Dharshan Kumaran +72w ago

How do LLMs Compute Verbal Confidence

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

Dharshan Kumaran, D. Kumaran, Arthur Conmy +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

All Papers (96)

Mar 18, 2026

Sofía Aguilar-Valdez +12w ago

Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution

Forget static embeddings: this paper shows how modeling scientific concepts as evolving complex networks reveals surprising connections between conceptual change and network topology.

Sofía Aguilar-Valdez, Stefania Degaetano-Ortlieb

Interpretability & Mechanistic Interp Natural Language Processing Scientific Discovery & Drug Design

Daisuke Yasui +22w ago

Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

Locomotion policies, often considered black boxes, can autonomously learn interpretable phase structures and branching logic, revealing a hidden order in their decision-making.

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

Interpretability & Mechanistic Interp Robotics & Embodied AI

Xianhang Cheng +32w ago

Steering Video Diffusion Transformers with Massive Activations

Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.

Xianhang Cheng, Yujian Zheng, Zhenyu Xie +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Dharshan Kumaran +72w ago

How do LLMs Compute Verbal Confidence

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

Dharshan Kumaran, D. Kumaran, Arthur Conmy +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Andor Diera +22w ago

Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

LLMs encode hierarchical semantic relations asymmetrically, with hypernymy being far more robust and redundantly represented than hyponymy.

Andor Diera, Ansgar Scherp, A. Scherp

Interpretability & Mechanistic Interp Natural Language Processing Open-Source Models & Weights

Yihong Chen2w ago

Attention Sinks Induce Gradient Sinks

Attention sinks aren't just a forward-pass phenomenon; they actively warp the training landscape by creating "gradient sinks" that drive massive activations.

Yihong Chen

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Saugat Aryal +12w ago

Informative Semi-Factuals for XAI: The Elaborated Explanations that People Prefer

People prefer XAI explanations that tell them *why* a feature change doesn't alter the outcome, not just *that* it doesn't.

Saugat Aryal, Mark T. Keane

Interpretability & Mechanistic Interp Natural Language Processing

Boyong Wu +12w ago·also Munich Center for Machine Learning

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.

Boyong Wu, Zeynep Akata

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Md Maruf Hossain +32w ago

Unsupervised Symbolic Anomaly Detection

Anomaly detection gets a dose of interpretability: SYRAN learns human-readable equations that flag anomalies by violating learned invariants.

Md Maruf Hossain, Tim Katzke, Simon Klüttermann +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

2w ago·also University of California Davis

DebugLM: Learning Traceable Training Data Provenance for LLMs

Pinpointing the training data behind an LLM's behavior is now possible without retraining, opening the door to precise debugging and targeted interventions.

Wenjie Jacky Mo, W. Mo, Qin Liu +4

Data Curation & Synthetic Data Interpretability & Mechanistic Interp

Shih-Heng Wang +62w ago

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Xiutian Zhao +42w ago

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Control the emotional tone of generated speech without any training by directly manipulating specific neurons within large audio-language models.

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn +2

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Guandong Li +12w ago

Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.

Guandong Li, Zhaobin Chu

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Yaze Zhao +32w ago

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.

Yaze Zhao, Yixiong Zou, Yuhua Li +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Yi Nian +22w ago

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

You can now audit multi-agent LLM systems and trace responsibility for harmful outputs even without access to internal execution logs, thanks to a clever "self-describing text" technique.

Yi Nian, Haosen Cao, Qingqing Luan

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

Skeleton-ID2w ago

A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

An AI model can estimate legal age from clavicle CT scans with higher accuracy than human experts, potentially revolutionizing forensic age assessment.

Javier Venema, Stefano De Luca, Pablo Mesejo +1

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Simon Klüttermann +32w ago

FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models

Unlock explainable outlier detection in foundation models with FoMo-X, a modular framework that adds negligible inference overhead while revealing interpretable risk tiers and calibrated confidence measures.

Simon Klüttermann, Tim Katzke, Phuong Huong Nguyen +1

Interpretability & Mechanistic Interp

Alexander Köhler +12w ago

CA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters

Standard PCA, despite its widespread use in CAD, struggles to directly reveal the original design parameters of a geometry, but this paper identifies conditions for accurate parameter estimation.

Alexander Köhler, Michael Breuß

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Yanke Yu +52w ago

Discovering Decoupled Functional Modules in Large Language Models

LLMs aren't monolithic black boxes: they contain spatially organized, functionally specialized modules that can be automatically discovered.

Yanke Yu, Jin Li, Ying Sun +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Mar 17, 2026

Joseph Cameron +22w ago

Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models

Forget one-hot encodings: conditioning timbre VAEs on continuous perceptual features unlocks more compact and controllable latent spaces.

Joseph Cameron, Alan F. Blackwell, Alan Blackwell

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Joao Manoel Herrera Pinheiro +92w ago

Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

Achieve expert-level accuracy in wasp identification with a YOLO-based model that also shows *why* it makes its classifications, thanks to integrated HiResCAM explainability.

Joao Manoel Herrera Pinheiro, Gabriela Do Nascimento Herrera, Alvaro Doria Dos Santos +7

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Yifan Zhang2w ago

Residual Stream Duality in Modern Transformer Architectures

Transformers have a hidden symmetry: depth-wise residuals are secretly doing the same thing as sequence-wise sliding window attention, unlocking new architectural insights.

Yifan Zhang

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

ISP RAS Research Center for Trusted AI2w ago·also HSE University, S-NLP Group, Skoltech

Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

LLMs often fail to update their final predictions after interventions on intermediate reasoning steps, suggesting that these structures function more as influential context than stable causal mediators.

Oleg Somov, Mikhail Chaichuk, Mikhail Seleznyov +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

S. Yatawatta +52w ago

Explainable machine learning workflows for radio astronomical data processing

Fuzzy logic and deep learning join forces to make radio astronomy ML pipelines less black-box.

S. Yatawatta, A. Ahmadi, B. Asabere +3

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Jiale Song +42w ago

Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Object hallucinations in LVLMs aren't just a language problem—abnormal visual attention patterns are also to blame, and can be fixed without retraining.

Jiale Song, Jiaxin Luo, Xue-song Tang +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Lucas Bandarkar +22w ago

Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency

By strategically exploiting LLMs' inconsistent cross-lingual performance, this work offers a surprisingly scalable way to pinpoint the specific experts responsible for storing and retrieving factual knowledge.

Lucas Bandarkar, Alan Ansell, Trevor Cohn

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Sasha Brenner +22w ago

Grid-World Representations in Transformers Reflect Predictive Geometry

Transformers trained on a simple grid-world learn hidden representations that directly reflect the underlying predictive geometry, offering a glimpse into how neural networks internalize structural constraints.

Sasha Brenner, Thomas R. Knösche, Nico Scherf

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp World Models & Planning

2w ago

Decoding the Critique Mechanism in Large Reasoning Models

LRMs can often recover from injected errors in their reasoning steps, revealing a hidden "critique" ability that can be harnessed to improve performance without additional training.

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Ann Rachel +42w ago

An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

An interpretable machine learning framework leveraging XGBoost and DeepSeek reveals key genetic factors driving drug response in lung cancer, offering a path towards personalized treatment strategies.

Ann Rachel, Pranav M Pawar, Mithun Mukharjee +2

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

2w ago·also JD.com

RecBundle: A Next-Generation Geometric Paradigm for Explainable Recommender Systems

Escape the flatland of traditional recommender systems: RecBundle uses differential geometry to disentangle user interactions from preferences, opening the door to understanding and mitigating systemic biases.

Hui Wang, Tianzhuo Hu, Tianzhu Hu +7

Interpretability & Mechanistic Interp Recommendation & Information Retrieval

Parsa Mirtaheri +12w ago

Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

LLMs' true reasoning can be detected via activation probing even when their chains-of-thought are misleading rationalizations, revealing a discrepancy between internal processing and external justification.

Parsa Mirtaheri, Mikhail Belkin

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

2w ago·also HIT, Provable Responsible AI and Data

Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data

A tabular LLM, TAP-GPT, rivals state-of-the-art general-purpose LLMs in few-shot Alzheimer's prediction while offering interpretable reasoning and robustness to missing data, opening the door to more transparent and reliable clinical AI.

Sophie Kearney, Sophie Kearney, Shu Yang +15

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

Tsinghua AI2w ago

DISCOVER: A Solver for Distributional Counterfactual Explanations

Distributional counterfactual explanations are now possible for black-box tabular models, thanks to a novel sparse search algorithm that sidesteps the need for gradients.

Yikai Gu, Lele Cao, Bo Zhao +2

Interpretability & Mechanistic Interp

Debdas Paul +32w ago

Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications

Adversarial representation learning can improve the out-of-distribution generalization of age predictors, but don't mistake correlation for causation.

Debdas Paul, Elisa Ferrari, Irene Gravili +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

David Orlando Salazar Torres +22w ago

Prior-Informed Neural Network Initialization: A Spectral Approach for Function Parameterizing Architectures

Injecting data-derived spectral priors into neural network initialization can dramatically accelerate convergence and improve the efficiency of function parameterizing architectures.

David Orlando Salazar Torres, Diyar Altinses, Andreas Schwung

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Vincenzo Buono +42w ago

Attribution Upsampling should Redistribute, Not Interpolate

Standard upsampling methods in XAI systematically corrupt attribution signals, but a novel semantic-aware redistribution approach provably preserves attribution mass and improves explanation faithfulness.

Vincenzo Buono, P. Mashhadi, Mahmoud Rahat +2

Computer Vision Interpretability & Mechanistic Interp

2w ago·also Beijing Value Simplex Technology Co. Ltd, University of Electronic Science and Technology, Yangtze Delta Research Institute

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

LLMs can now write better quantitative trading algorithms than humans, thanks to a new framework that turns unstructured financial reports into executable code.

Qinhong Lin, Ruitao Feng, Yinglun Feng +6

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

Redwan Sony +22w ago

MLLM-based Textual Explanations for Face Comparison

Even when multimodal LLMs get face verification right, their explanations are often wrong, relying on hallucinated facial attributes.

Redwan Sony, Anil K Jain, Ross Arun

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Rishaank Gupta2w ago

Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

Forget blindly pruning LLMs: this work shows you can use Sparse Autoencoders to identify and protect the most functionally important components during compression, leading to more robust models.

Rishaank Gupta

Eval Frameworks & Benchmarks Inference & Quantization Interpretability & Mechanistic Interp

Jia Qing Yap2w ago

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Forget fine-tuning: steer a 35B MoE's agency on the fly with SAE-decoded vectors, revealing a surprisingly simple, one-dimensional control knob.

Jia Qing Yap

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Tool Use & Agents

Mar 16, 2026

Jeonghye Kim +42w ago·also Microsoft Research

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

LLMs' "Aha!" moments aren't about magic tokens, but about explicitly verbalizing and managing uncertainty during reasoning, which drives performance.

Jeonghye Kim, Xufang Luo, Minbeom Kim +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Stanford HAI2w ago

W2T: LoRA Weights Already Know What They Can Do

Unlock the secrets hidden within LoRA weights: a novel method reveals that these weights already encode adapter behavior and performance, enabling accurate predictions without running the base model or accessing training data.

Xiaolong Han, Ferrante Neri, Zijian Jiang +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

2w ago·also CAS, JD.com

A Closer Look into LLMs for Table Understanding

LLMs dissect tables in three distinct attention phases: broad scanning, cell localization, and contribution amplification.

Jia Wang, Chuanyu Qin, Mingyu Zheng +2

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Timo Heiß +32w ago

Analyzing Error Sources in Global Feature Effect Estimation

Forget holdout data for feature effect estimation: training data's larger sample size usually wins, and cross-validation can further reduce model variance.

Timo Heiß, Coco Bögel, Bernd Bischl +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

A. Sakhno +72w ago·also Sber AI Lab

Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

Forget hand-crafted features: this system uses an LLM to automatically discover features from event sequences that outperform even state-of-the-art embeddings by up to 5.8%.

A. Sakhno, I. Sergeev, A. Shestov +5

Interpretability & Mechanistic Interp Natural Language Processing Recommendation & Information Retrieval

Pedro Bento +92w ago

POLAR:A Per-User Association Test in Embedding Space

Uncover hidden biases and track evolving viewpoints: POLAR reveals individual-level associations in text data that are masked by traditional aggregate analyses.

Pedro Bento, Arthur Buzelin, A. Chagas +7

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Lingyu Li +22w ago

Mechanistic Origin of Moral Indifference in Language Models

LLMs exhibit a surprising degree of moral indifference, compressing distinct moral concepts into uniform probability distributions, a problem that persists across model scales, architectures, and alignment techniques.

Lingyu Li, Yan Teng, Yingchun Wang

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness+1

2w ago·also Freiburg

CASHomon Sets: Efficient Rashomon Sets Across Multiple Model Classes and their Hyperparameters

Questioning the common practice of interpreting data through a single model class, this work reveals the existence of alternative well-performing models across multiple model classes and their hyperparameters.

Fiona Katharina Ewald, Martin Binder, Matthias Feurer +2

Interpretability & Mechanistic Interp Training Efficiency & Optimization

National Institute of Technology2w ago·also George Washington University, Montana Technological University

Interpretable Classification of Time Series Using Euler Characteristic Surfaces

Forget persistent homology's computational cost: Euler Characteristic Surfaces unlock 98% accuracy in ECG classification with linear complexity, rivaling deep learning while staying interpretable.

Salam Rabindrajit Luwang, Sushovan Majhi, Vishal Mandal +3

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Noe Claudel +22w ago

AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

Forget iterative optimization – this method synthesizes adversarial patches for facial re-ID in a single forward pass, dropping mAP from 90% to near zero.

Noe Claudel, Weisi Guo, Yang Xing

Computer Vision Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Alma Lago2w ago

Mechanistic Foundations of Goal-Directed Control

Infant motor learning reveals a sharp phase transition in control strategy arbitration, governed by context window size and predictable via a closed-form exponential moving average.

Alma Lago

Interpretability & Mechanistic Interp Robotics & Embodied AI World Models & Planning

2w ago

Interpretable Predictability-Based AI Text Detection: A Replication Study

Stylometric features, combined with modern multilingual language models, significantly boost the performance of machine-generated text detection, often surpassing language-specific models.

Adam Skurla, Dominik Macko, Jakub Simko

Interpretability & Mechanistic Interp Natural Language Processing Open-Source Models & Weights

Zhao Wang +12w ago

AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Debugging multi-agent systems just got easier: AgentTrace pinpoints root causes of failures with high accuracy and speed, without needing costly LLM inference during debugging.

Zhao Wang, Zhaohui Geoffrey Wang

Interpretability & Mechanistic Interp Tool Use & Agents

Fan Huang +22w ago

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

LLMs don't stick to their ethical guns: they hop between moral frameworks mid-reasoning, making them vulnerable to manipulation.

Fan Huang, Haewoon Kwak, Jisun An

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Felix Liedeker +32w ago

Do Metrics for Counterfactual Explanations Align with User Perception?

Algorithmic metrics for counterfactual explanations? Turns out humans don't really agree with them.

Felix Liedeker, Basil Ell, Philipp Cimiano +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

2w ago·also Freiburg

xplainfi: Feature Importance and Statistical Inference for Machine Learning in R

Unlock robust feature importance analysis with `xplainfi`, an R package that fills critical gaps by offering conditional importance methods and statistical inference for diverse ML models.

Lukas Burk, Fiona Katharina Ewald, Giuseppe Casalicchio +2

Interpretability & Mechanistic Interp Open-Source Models & Weights

Francesco Sovrano +32w ago

In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

KANs become far more robust and interpretable with in-context symbolic regression, achieving near-perfect error reduction in hyperparameter sweeps.

Francesco Sovrano, Lidia Losavio, Giulia Vilone +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Jacob Sanderson +22w ago

GradCFA: A Hybrid Gradient-Based Counterfactual and Feature Attribution Explanation Algorithm for Local Interpretation of Neural Networks

By blending counterfactual and feature attribution methods, GradCFA generates more realistic and diverse explanations, offering a richer understanding of neural network decisions than either approach alone.

Jacob Sanderson, Hua Mao, Wai Lok Woo

Interpretability & Mechanistic Interp

Mitul Goswami +22w ago

FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data

XGBoost models can be debiased for gender fairness in critical healthcare settings with minimal performance loss using a novel multi-metric Bayesian optimization approach.

Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Tsinghua AI2w ago

Why the Valuable Capabilities of LLMs Are Precisely the Unexplainable Ones

LLMs' true power lies in the "unexplainable" – capabilities that exceed rule-based systems, challenging the pursuit of full interpretability.

Quan Cheng

Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities

2w ago

SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing

LLMs can withstand 3,000 sequential knowledge edits without catastrophic forgetting, thanks to a new sparse editing framework that surgically manipulates knowledge circuits.

Yuhuan Liu, Haitian Zhong, Xinyuan Xia +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Lexiang Xiong +22w ago

Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

VLMs' hallucinations aren't just errors, but traceable pathologies in their "cognitive trajectory," diagnosable via geometric anomalies in a learned state space.

Lexiang Xiong, Qi Li, Jingwen Ye

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Multimodal Models

Mar 15, 2026

Maël Jenny +32w ago

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Forget prompt engineering – surgically altering a model's internal activations can jailbreak it, exposing vulnerabilities even when the input looks harmless.

Maël Jenny, Jérémie Dentan, Sonia Vanier +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

2w ago

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

LLMs can now offer globally contestable decision support by systematically mapping decision spaces into argumentation frameworks, allowing users to challenge the underlying logic, not just individual outputs.

Adam Dejl, Matthew Williams, Francesca Toni

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Jingyi Liu +22w ago·also HS Delft

A Loss Landscape Visualization Framework for Interpreting Reinforcement Learning: An ADHDP Case Study

Uncover the hidden dynamics of your RL agent with a new visualization framework that reveals how TD errors sculpt the optimization landscape and drive policy updates.

Jingyi Liu, Jian Guo, Eberhard Gill

Interpretability & Mechanistic Interp Robotics & Embodied AI

IIT2w ago·also Northeastern

Generation of Human Comprehensible Access Control Policies from Audit Logs

LLMs can automatically translate complex access control rules into plain English, making security policies understandable to non-experts.

Gautam Kumar, Ravi Sundaram, Shamik Sural

Interpretability & Mechanistic Interp Natural Language Processing

Jingyi Liu +22w ago·also HS Delft

Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms

Visualizing the critic's loss landscape reveals distinct characteristics linked to stable vs. unstable learning in online RL, offering a new window into algorithm dynamics.

Jingyi Liu, Jian Guo, Eberhard Gill

Interpretability & Mechanistic Interp Robotics & Embodied AI Training Efficiency & Optimization

2w ago·also Edinburgh

Seeking Physics in Diffusion Noise

Surprisingly, video diffusion models contain recoverable physics-related cues in their intermediate denoising representations, enabling more physically plausible video generation with reduced computational cost.

Chujun Tang, Lei Zhong, Fangqiang Ding

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Harel Yadid +72w ago

Make it SING: Analyzing Semantic Invariants in Classifiers

ResNet50 is shown to leak semantic attributes into its null space, while DinoViT better preserves class semantics, revealing critical differences in how these architectures handle semantic invariants.

Harel Yadid, Harel Yadid, Meir Yossef Levi +5

Computer Vision Interpretability & Mechanistic Interp

Mar 12, 2026

2w ago·also Institute of Science and Technology

Statistical and structural identifiability in representation learning

Even with weaker assumptions, ICA post-processing can unlock state-of-the-art disentanglement from vanilla autoencoders and foundation model-scale MAEs.

W. Nelson, Walter Nelson, Marco Fumero +2

Interpretability & Mechanistic Interp Training Efficiency & Optimization

2w ago

Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control

Single-cell foundation models exhibit surprising annotation bias, with 40% of highly connected features lacking biological annotation, suggesting current interpretability methods may be systematically skewed.

Ihor Kendiukhov

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

2w ago

Intrinsic Concept Extraction Based on Compositional Interpretability

Unlocking interpretable AI just got easier: HyperExpress disentangles image concepts into composable parts using hyperbolic space, letting you reconstruct visuals from their semantic building blocks.

Hanyu Shi, Hongjiang Tao, Guoheng Huang +5

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

2w ago·also NatWest AI Research

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Despite the intuition that noisy environments should make models rely more on visual cues, AVSR models stubbornly cling to audio, even when it's heavily degraded.

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Interpretability & Mechanistic Interp Multimodal Models Speech & Audio

Krzysztof Siwek +22w ago

Trust Oriented Explainable AI for Fake News Detection

XAI can boost trust in fake news detection by revealing which words sway the model, but choosing the right XAI method (SHAP, LIME, or Integrated Gradients) matters for performance and interpretability.

Krzysztof Siwek, Daniel Stankowski, M. Stodolski

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

2w ago

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Forget black-box anomaly detection: this neuro-symbolic VLM agent uses natural language descriptions and visual grounding to explain *why* an event occurred in multivariate time series data, even with little training.

Sky Chenwei Wan, T. Hou, Yifei Wang +2

Interpretability & Mechanistic Interp Multimodal Models Tool Use & Agents

U. Liege2w ago

Context-dependent manifold learning: A neuromodulated constrained autoencoder approach

Neuromodulation offers a way to disentangle global contextual parameters from local manifold representations in constrained autoencoders, enabling context-aware dimensionality reduction.

J. Adriaens, G. Drion, Pierre Sacr'e Neuroengineering Lab +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

MIT CSAIL2w ago

To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Fine-tuning unlocks LLMs' surprising ability to predict how memorable a sentence is and how long it takes to read, exceeding traditional methods.

Thomas Hikaru Clark, Carlos Arriaga, Javier Conde +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

School of Computer Science and Electronic Engineering University of Essex2w ago

Interpreting Contrastive Embeddings in Specific Domains with Fuzzy Rules

Fuzzy rules reveal how CLIP encodes domain-specific features in clinical reports and film reviews, offering a peek inside the black box of multimodal embeddings.

Javier Fumanal-Idocin, Mohammadreza Jamalifard, Javier Andreu-Perez

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Bo Hu +12w ago

A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis

Ditch the concatenation: a new neural dependence estimator sidesteps MINE's computational baggage, offering a more stable and efficient way to analyze autoencoder features.

Bo Hu, J. Príncipe

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

A. Yasser +42w ago

Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

Uncover the surprising locations of demographic biases within CLIP's vision encoder by pinpointing specific attention heads responsible for encoding gender and age stereotypes.

A. Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo +2

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Multimodal Models

Mateusz Pach +42w ago·also KU, Munich Center for Machine Learning

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

Unlock precise, training-free color control in text-to-image models by directly manipulating the latent space's emergent Hue, Saturation, and Lightness structure.

Mateusz Pach, Jessica Bader, Quentin Bouniot +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Omar Coser2w ago

ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

Stop wrestling with opaque expression models: ELISA lets you directly translate single-cell RNA sequencing data into mechanistic biological hypotheses using an interpretable hybrid generative AI agent.

Omar Coser

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design Tool Use & Agents

Chuancheng Shi +52w ago

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Concept erasure in text-to-image models no longer needs to be a blunt instrument: OrthoEraser precisely removes harmful content while preserving image quality by analytically orthogonalizing the erasure process.

Chuancheng Shi, Wenhua Wu, Fei Shen +3

Interpretability & Mechanistic Interp Multimodal Models Red-Teaming & Adversarial Robustness

Yuval Ran-milo2w ago

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Softmax attention's normalization creates unavoidable "attention sinks" when implementing trigger-conditional logic, but ReLU attention offers a sink-free alternative.

Yuval Ran-milo

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Mar 11, 2026

Maximilian Diehl +43w ago

A Causal Approach to Predicting and Improving Human Perceptions of Social Navigation Robots

Robots can boost their perceived competence by 83% simply by tweaking navigation behaviors suggested by a causal Bayesian network.

Maximilian Diehl, Nathan Tsoi, Gustavo Chávez +2

Interpretability & Mechanistic Interp Robotics & Embodied AI

Gideon Popoola +13w ago

Procedural Fairness via Group Counterfactual Explanation

Achieving fairness doesn't just mean equal outcomes—this work shows how to enforce consistent reasoning across groups by penalizing disparities in counterfactual explanations.

Gideon Popoola, John Sheppard

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Natural Language Processing

Eirik Høyheim +43w ago

Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Uncover hidden backdoors in your neural networks by tracing the active paths that malicious triggers exploit.

Eirik Høyheim, M. Eckhoff, G. Grov +2

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

3w ago

Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Forget subjective human evaluations: this paper uses a clever knowledge distillation trick to objectively rank XAI methods for NMT, revealing that attention-based attributions beat gradient-based ones.

A. Nourbakhsh, Salima Lamsiyah, Adelaide Danilov +1

Inference & Quantization Interpretability & Mechanistic Interp Natural Language Processing

3w ago

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Uncover the hidden causal chains inside your LLM with Causal Concept Graphs, which outperform existing methods for reasoning by explicitly modeling concept dependencies.

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

3w ago

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Speech deepfake detection gets a reasoning upgrade: HIR-SDD uses chain-of-thought prompting with Large Audio Language Models to not only detect fakes but also explain *why* it thinks they're fake.

Artem Dvirniak, E. Kushnir, Dmitrii Tarasov +5

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also XJTU

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Clinicians using HeartAgent, a cardiology-specific agent system, improved diagnostic accuracy by 26.9% and explanatory quality by 22.7% compared to unaided experts.

Shuang Zhou, Kai Yu, Song Wang +10

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

DeepMind3w ago

Taking Shortcuts for Categorical VQA Using Super Neurons

Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.

Pierre Musacchio, Jaeyi Jeong, Dahun Kim

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Weihang Huang +13w ago

Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Chinese metaphor identification is highly sensitive to the choice of protocol, dwarfing the impact of model-level variations, yet can be tackled with fully transparent, LLM-assisted rule scripts.

Weihang Huang, Mengna Liu

Interpretability & Mechanistic Interp Natural Language Processing

3w ago

Prism-$\Delta$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Prompt highlighting in LLMs gets a serious upgrade: PRISM-$\Delta$ steers models to focus on relevant text spans with better accuracy and fluency, even in long contexts.

Yuyao Ge, Shenghua Liu, Yiwei Wang +6

Interpretability & Mechanistic Interp Natural Language Processing

Yangyang Qu +33w ago

Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics

Fair-Gate disentangles speaker identity and sex in voice biometrics, boosting fairness without sacrificing accuracy by explicitly routing features through identity and sex-specific pathways.

Yangyang Qu, Todisco Massimiliano, Galdi Chiara +1

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Speech & Audio

3w ago·also MBZUAI, Provable Responsible AI and Data

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

LLMs possess a "word recovery" mechanism that allows them to reconstruct canonical word-level tokens from character-level inputs, explaining their surprising robustness to non-canonical tokenization.

Zhipeng Yang, Shu Yang, Lijie Hu +1

Interpretability & Mechanistic Interp Natural Language Processing

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (96)