Interpretability & Mechanistic Interp

Safety & Alignment

Understanding the internal mechanisms of neural networks through circuit analysis, feature visualization, and mechanistic interpretability.

Keywords

mechanistic interpretabilitycircuit analysisfeature visualizationsuperpositionsparse autoencodersprobingneural network interpretabilityactivation patching

Recent Papers

Feb 21, 2026

just now

A Model-Hardware Co-design Framework for Robust and Efficient CNN-Based SAR ATR

This paper introduces a model-hardware co-design framework for CNN-based SAR ATR that jointly optimizes adversarial robustness, model compression, and FPGA accelerator design. The framework uses hardware-guided structured pruning, informed by a hardware performance model, to explore robustness-efficiency trade-offs. Experiments on MSTAR and FUSAR-Ship datasets show the framework produces models up to 18.3x smaller with 3.1x fewer MACs while preserving robustness, and the FPGA implementation achieves significant latency and energy efficiency improvements compared to CPU/GPU baselines.

Develops a model-hardware co-design framework that unifies robustness-aware model compression and FPGA accelerator design for CNN-based SAR ATR, enabling exploration of robustness-efficiency trade-offs.

S. Wickramasinghe, Tian Ye, C. Raghavendra +1

Interpretability & Mechanistic InterpInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

Feb 12, 2026

2d ago

Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging

This paper introduces a calibrated Bayesian deep learning framework for medical imaging decision support, addressing the critical need for reliable uncertainty quantification in AI-assisted diagnostics. The framework combines a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) during training, which penalizes high-confidence errors and low-confidence correct predictions, with a post-hoc Dual Temperature Scaling (DTS) strategy to refine the posterior distribution. Validated on pneumonia screening, diabetic retinopathy detection, and skin lesion identification, the approach demonstrates improved calibration, robust performance in data-scarce scenarios, and effectiveness on imbalanced datasets.

Introduces a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) and Dual Temperature Scaling (DTS) strategy to improve calibration and uncertainty quantification in Bayesian deep learning models for medical imaging.

Julián D. Arias-Londoño, Juan Ignacio Godino-Llorente2602.11973

Interpretability & Mechanistic InterpComputer VisionScientific Discovery & Drug Design

2d ago

ModelWisdom: An Integrated Toolkit for TLA+ Model Visualization, Digest and Repair

The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.

Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.

Zhiyong Chen, S. Cheung2602.12058

Interpretability & Mechanistic InterpCode Generation & Program SynthesisTool Use & Agents

2d ago

Dopamine: Brain Modes, Not Brains

The paper introduces a novel parameter-efficient fine-tuning (PEFT) method called \methodname{} that adapts large pretrained models by learning per-neuron thresholds and gains in activation space, inspired by neuromodulation. This approach aims to change the mode of computation by selecting and rescaling existing computations rather than rewriting weights, offering improved interpretability. Experiments on MNIST and rotated MNIST demonstrate that \methodname{} can improve accuracy over a frozen baseline with significantly fewer trainable parameters than LoRA, while also enabling neuron-level attribution and conditional computation.

Introduces \methodname{}, a parameter-efficient fine-tuning method that learns per-neuron thresholds and gains in activation space to adapt pretrained models by changing the mode of computation.

S. Ghasemlou2602.11726

Interpretability & Mechanistic InterpTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

This paper introduces Hierarchical Sparse Autoencoders (HSAEs) to explicitly model the hierarchical relationships between features extracted from LLMs, addressing the limitation of standard SAEs that treat features in isolation. HSAEs incorporate a structural constraint loss and random feature perturbation to encourage alignment between parent and child features in the learned hierarchy. Experiments across various LLMs and layers demonstrate that HSAEs recover semantically meaningful hierarchies while preserving reconstruction fidelity and interpretability.

Introduces Hierarchical Sparse Autoencoders (HSAEs) to learn and represent the hierarchical relationships between features extracted from LLMs.

Jiedong Jiang2602.11881

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Natural Language Processing

2d ago

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

This paper extends crosscoder model diffing to cross-architecture comparisons, enabling the unsupervised discovery of behavioral differences between LLMs with different architectures. They introduce Dedicated Feature Crosscoders (DFCs), an architectural modification to improve the isolation of unique features in one model compared to another. Applying this technique, they identify features such as CCP alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B.

Introduces Dedicated Feature Crosscoders (DFCs), an architectural modification to enhance crosscoder model diffing for isolating features unique to individual models in cross-architecture comparisons.

Thomas Jiralerspong, Trenton Bricken2602.11729

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights

2d ago

Protein Circuit Tracing via Cross-layer Transcoders

The paper introduces ProtoMech, a framework for mechanistic interpretability of protein language models (pLMs) that uses cross-layer transcoders to learn sparse latent representations capturing the model's full computational circuitry. By jointly analyzing representations across layers of ESM2, ProtoMech identifies compressed circuits that retain significant performance on protein family classification and function prediction while using only a small fraction of the latent space. Steering along these identified circuits enables high-fitness protein design, demonstrating the framework's utility in understanding and manipulating pLM behavior.

Introduces ProtoMech, a novel framework that discovers computational circuits in protein language models by learning sparse, cross-layer latent representations.

Darin Tsui, Kunal Talreja, Daniel Saeedi +12602.12026

Interpretability & Mechanistic InterpScientific Discovery & Drug Design

2d ago

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

The paper introduces the Prototype Transformer (ProtoT), an autoregressive language model architecture that uses prototypes (parameter vectors) instead of self-attention to improve interpretability. ProtoT establishes two-way communication between the input sequence and the prototypes, causing the prototypes to capture nameable concepts during training and creating interpretable communication channels. Experiments demonstrate that ProtoT scales linearly with sequence length, performs well on text generation and downstream tasks (GLUE), and exhibits robustness to input perturbations while providing interpretable pathways for understanding robustness and sensitivity.

Introduces the Prototype Transformer, a novel autoregressive language model architecture designed for interpretability by using prototypes to capture nameable concepts and create interpretable communication channels.

Yordan Yordanov, Matteo Forasassi, Bayar Menzat +62602.11852

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Natural Language Processing

2d ago

InjectRBP: Steering Large Language Model Reasoning Behavior via Pattern Injection

The paper investigates how reasoning behaviors in LLMs influence reasoning quality by analyzing behavioral patterns in model responses. They find that injecting specific reasoning behavior patterns can significantly improve reasoning outcomes. Based on this, they propose two parameter-free optimization methods, InjectCorrect (imitating patterns from past correct answers) and InjectRLOpt (using a learned value function to generate behavior injectants), to steer the reasoning process.

Introduces InjectRBP, a novel framework for steering LLM reasoning by structurally injecting observed behavioral patterns, without requiring parameter updates.

Xiuping Wu, Zhao Yu, Yuxin Cheng +42602.12013

Reasoning & Chain-of-ThoughtInterpretability & Mechanistic Interp

2d ago

A Rule-based Computational Model for Gaidhlig Morphology

The paper introduces a rule-based computational model for Gaidhlig morphology, addressing the challenge of limited data availability for low-resource languages that hinders the application of neural models. The model leverages data from Wiktionary and uses SQL queries to identify lexical patterns, constructing a declarative rule-base for generating inflected word forms via Python utilities. This approach demonstrates that rule-based systems can effectively utilize limited data while providing interpretability and supporting the development of educational tools.

Presents a functional rule-based system for Gaidhlig morphology using Wiktionary data and SQL queries to generate inflected word forms.

Peter J Barclay2602.12132

Natural Language ProcessingInterpretability & Mechanistic Interp

2d ago

V-SHiNE: A Virtual Smart Home Framework for Explainability Evaluation

The paper introduces V-SHiNE, a browser-based virtual smart home environment designed to facilitate the evaluation of explainable AI (XAI) methods in the context of smart home automation. V-SHiNE enables researchers to configure realistic smart home environments, simulate user behaviors, integrate custom explanation engines, and log user interactions. A user study with 159 participants demonstrates the framework's utility for assessing the impact and quality of different explanation strategies.

Introduces V-SHiNE, a novel browser-based simulation framework, to enable scalable and reproducible evaluation of XAI methods within virtual smart home environments.

Mersedeh Sadeghi, Simon Scholz, Max Unterbusch +12602.11775

Eval Frameworks & BenchmarksWorld Models & PlanningInterpretability & Mechanistic Interp

2d ago

The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics

The paper investigates whether neural world models truly learn physical laws or rely on statistical shortcuts, particularly under out-of-distribution shifts. They introduce PhyIP, a non-invasive evaluation protocol that assesses the linear decodability of physical quantities from frozen latent representations, contrasting it with adaptation-based methods. Their results show that when self-supervised learning achieves low error, latent physical structures are linearly accessible and robust to OOD shifts, while adaptation-based evaluations can collapse this structure, suggesting that non-invasive probes are more accurate for evaluating physical world models.

Introduces PhyIP, a non-invasive evaluation protocol, to accurately assess the linear accessibility of physical quantities in frozen latent representations of world models, demonstrating its superiority over adaptation-based methods.

Christian Internò, Jumpei Yamaguchi, Loren K. Amdahl-Culleton +32602.12218

World Models & PlanningInterpretability & Mechanistic InterpEval Frameworks & Benchmarks

2d ago

DMAP: A Distribution Map for Text

The paper introduces Distribution Map (DMAP), a novel method for representing text using next-token probability distributions from LLMs by mapping text to samples in the unit interval that encode rank and probability. DMAP addresses the limitations of perplexity by accounting for context and the shape of the conditional distribution. The authors demonstrate DMAP's utility in validating generation parameters, detecting machine-generated text via probability curvature, and performing forensic analysis of models fine-tuned on synthetic data.

Introduces DMAP, a mathematically grounded method for representing text as a distribution of samples in the unit interval based on next-token probability distributions from LLMs, enabling efficient and model-agnostic text analysis.

Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban +52602.11871

Natural Language ProcessingEval Frameworks & BenchmarksInterpretability & Mechanistic Interp

2d ago

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

This paper investigates the problem of unstable feature importance estimates in expressive machine learning models, which hinders their use in scientific discovery. The authors theoretically analyze the bias-variance tradeoff in aggregating feature importance estimates, demonstrating that ensembling at the model level yields more accurate estimates by reducing excess risk. They empirically validate their theoretical findings on benchmark datasets and a large-scale proteomic study from the UK Biobank.

Demonstrates theoretically and empirically that ensembling at the model level, rather than aggregating individual model explanations, provides more accurate feature importance estimates, especially for expressive models.

Joseph Paillard, Angel Reyero Lobo, D.A. Engemann +12602.11760

Interpretability & Mechanistic InterpScientific Discovery & Drug Design

2d ago

TADA! Tuning Audio Diffusion Models through Activation Steering

This paper investigates the internal representations of high-level musical concepts within audio diffusion models using activation patching, revealing that a small subset of attention layers controls distinct semantic concepts. They then use Contrastive Activation Addition and Sparse Autoencoders in these key layers to achieve more precise control over audio generation. The authors demonstrate the ability to manipulate specific musical elements like tempo and mood by steering activations in the identified layers.

Demonstrates precise control over generated audio by identifying and steering activations in specific attention layers of audio diffusion models.

Lukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski +12602.11910

Interpretability & Mechanistic InterpSpeech & AudioArchitecture Design (Transformers, SSMs, MoE)

2d ago

SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG

This paper introduces SpaTeoGL, a spatiotemporal graph learning framework that constructs window-level spatial graphs of iEEG electrode interactions and a temporal graph linking time windows based on spatial graph similarity. The method uses a smooth graph signal processing formulation solved via alternating block coordinate descent, providing convergence guarantees. Experiments on a multicenter iEEG dataset demonstrate that SpaTeoGL achieves competitive SOZ localization performance compared to horizontal visibility graphs and logistic regression, while also enhancing non-SOZ identification and offering interpretable insights into seizure dynamics.

Introduces a novel spatiotemporal graph learning framework, SpaTeoGL, to model and interpret seizure onset zone dynamics from iEEG data.

Elham Rostami, Aref Einizade, T. Laleg‐Kirati2602.11801

Interpretability & Mechanistic InterpScientific Discovery & Drug Design

2d ago

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

This paper investigates jailbreaking attacks on LLMs by analyzing differences in internal representations between jailbreak and benign prompts across multiple open-source models (GPT-J, LLaMA, Mistral, Mamba). They propose a tensor-based latent representation framework to capture structure in hidden activations, enabling jailbreak detection without fine-tuning or auxiliary LLMs. By selectively bypassing high-susceptibility layers in LLaMA-3.1-8B, the method blocks 78% of jailbreak attempts while preserving 94% of benign behavior, demonstrating the potential for inference-time interventions.

Introduces a tensor-based latent representation framework for detecting and disrupting jailbreak attacks by analyzing and manipulating internal activations of LLMs at inference time.

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis2602.11495

Red-Teaming & Adversarial RobustnessInterpretability & Mechanistic Interp

2d ago

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

The paper introduces SafeNeuron, a neuron-level safety alignment framework for LLMs designed to improve robustness against neuron-level attacks. It identifies and freezes safety-related neurons during preference optimization, forcing the model to develop redundant safety representations across the network. Experiments show SafeNeuron enhances robustness against neuron pruning attacks, mitigates the risk of models being used for red-teaming, and maintains general capabilities, while also revealing stable and shared internal safety representations.

Introduces SafeNeuron, a novel neuron-level safety alignment framework that enhances LLM robustness by redistributing safety representations across the network.

Jiaming Liang, Tat-Seng Chua2602.12158

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsInterpretability & Mechanistic Interp

2d ago

Improving Neural Retrieval with Attribution-Guided Query Rewriting

This paper introduces an attribution-guided query rewriting method to improve the robustness of neural retrievers when faced with underspecified or ambiguous queries. The approach computes gradient-based token attributions from the retriever to identify problematic query components and then uses these attributions to guide an LLM in rewriting the query. Experiments on BEIR collections demonstrate that this method consistently improves retrieval effectiveness compared to existing query rewriting and explainability-based techniques, especially for implicit or ambiguous information needs.

Introduces an attribution-guided query rewriting framework that leverages retriever feedback to improve query clarity and retrieval effectiveness.

Moncef Garouani, Josiane Mothe2602.11841

Recommendation & Information RetrievalNatural Language ProcessingInterpretability & Mechanistic Interp

Feb 11, 2026

3d ago

Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity

The paper introduces Neural Additive Experts (NAEs), a mixture-of-experts framework that learns specialized networks per feature and uses a dynamic gating mechanism to integrate information across features, relaxing the strict additivity of standard GAMs. By employing targeted regularization techniques to reduce variance among expert predictions, NAEs enable a smooth transition from additive models to those capturing feature interactions. Experiments on synthetic and real-world datasets demonstrate that NAEs achieve a better balance between predictive accuracy and feature-level interpretability compared to standard GAMs.

Introduces Neural Additive Experts (NAEs), a novel architecture that balances predictive accuracy and feature-level interpretability by using a mixture-of-experts framework with dynamic gating and targeted regularization to control the degree of model additivity.

Guangzhi Xiong, Aidong Zhang2602.10585

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)

Feb 6, 2026

1w ago

From Features to Actions: Explainability in Traditional and Agentic AI Systems

This paper investigates the applicability of attribution-based explainability methods, commonly used for static classification tasks, to agentic AI systems where behavior emerges over multi-step trajectories. The authors compare attribution-based explanations with trace-based diagnostics in both static classification and agentic benchmarks (TAU-bench Airline and AssistantBench). They find that attribution methods, while stable in static settings, are unreliable for diagnosing execution-level failures in agentic trajectories, whereas trace-grounded rubric evaluation effectively localizes behavior breakdowns.

Demonstrates the limitations of applying attribution-based explainability methods designed for static predictions to agentic AI systems and advocates for trajectory-level explainability.

Sindhuja Chaduvula, Kina Kim, Aravind Narayanan +42602.06841

Interpretability & Mechanistic InterpTool Use & Agents

Feb 3, 2026

1w ago

Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World

This paper identifies three key dimensions of safety for foundation model (FM)-enabled robots: action, decision, and human-centered safety, arguing that existing methods are insufficient for open-ended real-world scenarios. To address this, they propose a modular safety guardrail architecture with monitoring and intervention layers to ensure comprehensive safety across the autonomy stack. The paper further suggests cross-layer co-design strategies, such as representation alignment and conservatism allocation, to improve the speed and effectiveness of safety enforcement.

Proposes a modular safety guardrail architecture, composed of monitoring and intervention layers, to address the multifaceted safety challenges of deploying foundation model-enabled robots in real-world environments.

Wenxi Chen, Davood Soleymanzadeh, Yi Ding +82602.04056

Interpretability & Mechanistic InterpReasoning & Chain-of-ThoughtRobotics & Embodied AI

Feb 1, 2026

1w ago

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

The paper investigates the role of individual layers in Vision-Language Models (VLMs) and discovers the existence of Task-Interfering Layers (TILs) that hinder downstream task performance. They quantify the effect of intervening on each layer using a Task-Layer Interaction Vector and observe task-specific sensitivity patterns. Based on these findings, they propose TaLo, a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer, achieving significant performance improvements on various tasks and models.

Discovers and characterizes Task-Interfering Layers in VLMs, demonstrating that bypassing these layers at inference time can improve performance without retraining.

Zhiming Liu, Yujie Wei, Lei Feng +52602.01167

Multimodal ModelsInterpretability & Mechanistic Interp

Jan 30, 2026

Learning and Intelligent Systems Lab (LiSL)2w ago

Coulomb force-guided deep reinforcement learning for effective and explainable robotic motion planning

This paper introduces a physics-inspired deep reinforcement learning (DRL) framework for robotic motion planning that leverages Coulomb forces to model interactions between the robot, goal, and obstacles. The approach incorporates these forces into the reward function, providing attractive and repulsive signals, and further enhances collision avoidance using anticipatory rewards derived from LiDAR segmentation of obstacle boundaries. Experiments in both Gazebo simulations and real-world TurtleBot v3 deployments demonstrate that the proposed method reduces collisions and generates safer trajectories.

Introduces a novel physics-inspired reward function for DRL-based robotic motion planning using Coulomb forces and LiDAR-based anticipatory rewards to improve safety and explainability.

Sirui Song, Trevor Bihl, Jundong Liu

Robotics & Embodied AIInterpretability & Mechanistic Interp

2w ago

Human-Centered Explainability in AI-Enhanced UI Security Interfaces: Designing Trustworthy Copilots for Cybersecurity Analysts

This paper investigates the impact of different explanation styles in AI-driven security dashboards on user trust, decision accuracy, and cognitive load. The authors conducted a mixed-methods study with security practitioners, comparing natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Results demonstrate that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load, leading to design guidelines for integrating explainability into enterprise UIs.

Empirically demonstrates the impact of various explanation styles on security analysts' trust, decision-making, and cognitive load within AI-enhanced UI security interfaces.

Mona Rajhans2601.22653

Interpretability & Mechanistic InterpRLHF & Preference Learning

Jan 29, 2026

2w ago

Jailbreaks on Vision Language Model via Multimodal Reasoning

This paper introduces a jailbreak framework for vision-language models (VLMs) that combines Chain-of-Thought (CoT) prompting with a ReAct-driven adaptive noising mechanism to bypass safety filters. The adaptive noising iteratively perturbs input images based on model feedback, focusing on regions that trigger safety defenses. Experiments show that this dual-strategy significantly improves attack success rates (ASR) while preserving the naturalness of both text and visual inputs.

Introduces a novel jailbreak framework combining CoT prompting with ReAct-driven adaptive image noising to effectively bypass VLM safety filters.

Aarush Noheria, Yuguang Yao2601.22398

Interpretability & Mechanistic InterpMultimodal ModelsReasoning & Chain-of-Thought

2w ago

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

This paper addresses the problem of spurious correlations in reward models used in Reinforcement Learning from Human Feedback (RLHF) by proposing a factored representation learning framework. The framework decomposes contextual embeddings into causal factors sufficient for reward prediction and non-causal factors capturing reward-irrelevant attributes, constraining the reward head to depend only on the causal component. Experiments on mathematical and dialogue tasks demonstrate improved robustness and downstream RLHF performance compared to baselines, with analyses showing mitigation of reward hacking behaviors like exploiting length and sycophantic bias.

Introduces a factored representation learning framework that decomposes contextual embeddings into causal and non-causal factors to improve the robustness of reward models in RLHF.

Yupei Yang, Lin Yang, Wanxi Deng +52601.21350

RLHF & Preference LearningInterpretability & Mechanistic Interp

Jan 21, 2026

RSNA3w ago

RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

The authors created REVEAL-CXR, a benchmark dataset of 200 chest radiographs with 12 labels for cardiothoracic disease, to evaluate multimodal large language models (LLMs) in radiology. They used GPT-4o and Phi-4-Reasoning to extract and map findings from 13,735 chest radiograph reports, then sampled 1,000 studies for expert radiologist review. The final dataset of 200 radiographs, verified by three radiologists, is publicly available for benchmarking and includes a holdout set for independent model evaluation by RSNA.

Introduces REVEAL-CXR, a high-quality, expert-validated benchmark dataset for chest radiograph interpretation, designed to facilitate the development and evaluation of clinically useful multimodal LLMs in radiology.

Yishu Wei, Adam E. Flanders, E. Colak +352601.15129

Multimodal ModelsInterpretability & Mechanistic Interp

Jan 19, 2026

3w ago

Heart2Mind: Human-Centered Contestable Psychiatric Disorder Prediction System Using Wearable ECG Monitors

The paper introduces Heart2Mind, a Contestable AI (CAI) system for psychiatric disorder prediction using wearable ECG data, designed to allow clinicians to inspect and revise algorithmic outputs. The system employs a Multi-Scale Temporal-Frequency Transformer (MSTFT) to analyze R-R intervals from ECG sensors, combining time and frequency domain features. Results on the HRV-ACC dataset show MSTFT achieves 91.7% accuracy, and human-centered evaluation demonstrates that experts and the CAI system can effectively collaborate to confirm correct decisions and correct errors through dialogue.

Introduces a contestable AI system, Heart2Mind, that integrates a multi-scale temporal-frequency transformer with self-adversarial explanations and a collaborative chatbot to enable clinicians to scrutinize and refine psychiatric disorder predictions based on wearable ECG data.

Hung Nguyen, Alireza Rahimi, Veronica Whitford +4

Interpretability & Mechanistic InterpRLHF & Preference Learning

Jan 18, 2026

3w ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

This paper introduces Multimodal Generative Engine Optimization (MGEO), a novel adversarial attack framework that exploits vulnerabilities in VLM-based product search ranking systems. MGEO jointly optimizes imperceptible image perturbations and fluent textual suffixes to unfairly promote a target product, leveraging the cross-modal coupling within VLMs. Experiments on real-world datasets demonstrate that MGEO significantly outperforms unimodal attacks, highlighting the vulnerability of VLMs to coordinated multimodal manipulation.

Reveals a critical vulnerability in VLM-based ranking systems by demonstrating a coordinated multimodal attack that significantly outperforms unimodal attacks.

Yixuan Du, Chenxiao Yu, Haoyan Xu +32601.12263

Interpretability & Mechanistic InterpMultimodal Models

Jan 15, 2026

School of Computer Science and Digital TechnologiesJan 15, 2026

Human-Centered User Interface Design for Explainable AI in Chest Radiology: A Multi-Phase Co-Design Approach

This paper addresses the challenge of integrating Explainable AI (XAI) into chest radiology by developing two deep learning-based XAI systems for pneumonia and COVID-19 detection using Grad-CAM and LIME. They introduce a multi-phase Human-Centered Design (HCD) methodology involving radiologists and clinicians in co-design and iterative prototyping to create a usable XAI interface. The study found that radiologists preferred combined original and AI-annotated images with adjustable overlays and tailored explanatory text, and that confidence scores aligned with clinical reasoning enhance trust and adoption.

Introduces a multi-phase Human-Centered Design (HCD) methodology for XAI in chest radiology, emphasizing participatory co-design and iterative prototyping with radiologists and clinicians.

Shereen Fouad, Lilit Hakobyan, Izegbua E. Ihongbe +3

Interpretability & Mechanistic InterpMultimodal ModelsComputer Vision

Jan 13, 2026

Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

This paper investigates visual-text fusion in MLLMs through layer-wise masking and attention analysis, revealing non-uniform fusion across layers and a late-stage visual signal reactivation. The authors identify persistent high-attention noise on irrelevant regions and increasing attention on text-aligned areas during processing. Based on these insights, they propose a training-free contrastive attention framework that models attention shifts between early fusion and final layers, enhancing multimodal reasoning.

Introduces a training-free contrastive attention framework that models attention shifts between early fusion and final layers to improve multimodal reasoning in MLLMs.

Shezheng Song, Shasha Li, Jie Yu2601.08151

Interpretability & Mechanistic InterpMultimodal ModelsArchitecture Design (Transformers, SSMs, MoE)

Jan 12, 2026

SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis

The paper introduces SCALPEL, a framework for selectively ablating capabilities in LLMs by representing them as low-rank parameter subspaces and using LoRA adapters to reduce the model's ability to distinguish correct from incorrect answers on specific tasks. This approach allows for fine-grained capability removal without affecting other capabilities, addressing the limitations of coarse-grained methods that assume direct mapping between capabilities and modules. Experiments on diverse tasks demonstrate SCALPEL's effectiveness in removing target capabilities while preserving general language modeling abilities, revealing the low-rank structure of capabilities and enabling targeted parameter-space interventions.

Introduces SCALPEL, a novel method for selectively ablating capabilities in LLMs by identifying and modifying low-rank parameter subspaces associated with those capabilities.

Zihao Fu, Xufeng Duan, Zhenguang G. Cai2601.07411

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)

Jan 8, 2026

Evaluation of Large Language Model-Based Chatbots for Dental Trauma Management: A Comparative Study Based on Accuracy, Consistency and Information Quality.

This study benchmarked ChatGPT-5, Claude AI (Sonnet 4.0), and Perplexity (Mistral Large 2) on their ability to answer dental trauma questions, assessing accuracy, consistency, readability, and information quality. Perplexity demonstrated the highest accuracy on true/false questions, while ChatGPT excelled in readability, Perplexity in understandability and actionability, and Claude in information reliability for open-ended questions. The results suggest that LLM-based chatbots can play a complementary role in dental trauma management, with tool selection dependent on the specific application.

Quantifies the performance of three prominent LLM-based chatbots across multiple dimensions relevant to dental trauma management, highlighting their strengths and weaknesses.

Vasfiye Işık, Rana Ikbal Sengul, Soner Şişmanoğlu

Interpretability & Mechanistic InterpReasoning & Chain-of-Thought

Jan 6, 2026

When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability

This paper replicates Anthropic's mechanistic interpretability work using sparse autoencoders (SAEs) on Llama 3.1 to extract and steer human-interpretable features, stress-testing the generalizability of these methods. The authors successfully reproduce basic feature extraction and steering, but find significant fragility in feature steering, sensitivity to various parameters, and difficulty in distinguishing thematically similar features. The study concludes that current SAE-based interpretability methods lack the systematic reliability needed for safety-critical applications, suggesting a shift towards prioritizing reliable model output prediction and control.

Demonstrates the fragility and limitations of current SAE-based mechanistic interpretability techniques for Llama 3.1, particularly regarding feature steering and thematic feature differentiation.

Raphael Ronge, Markus Maier, Frederick Eberhardt2601.03047

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic InterpOpen-Source Models & Weights

Jan 5, 2026

Hidden State Poisoning Attacks against Mamba-based Language Models

This paper introduces Hidden State Poisoning Attacks (HiSPAs) that exploit vulnerabilities in Mamba-based language models by overwriting information in their hidden states, leading to a partial amnesia effect. The authors evaluate the impact of HiSPAs using the RoBench25 benchmark, demonstrating the susceptibility of SSMs, including a 52B Jamba model, to these attacks, unlike pure Transformers. Furthermore, they show that HiSPA triggers weaken the Jamba model on the Open-Prompt-Injections benchmark and provide an interpretability analysis of Mamba's hidden layers during attacks.

Demonstrates the vulnerability of Mamba-based language models to Hidden State Poisoning Attacks (HiSPAs), which induce partial amnesia by overwriting information in hidden states.

Alexandre Le Mercier, Chris Develder, Thomas Demeester2601.01972

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)

2026

Jan 1, 2026

Human and AI collaboration failures and model performance gaps in cardiac surgery: a blinded two-phase evaluation of five large language models

This paper evaluates the clinical performance of five large language models (LLMs) in complex cardiac surgery scenarios using a blinded two-phase evaluation by senior surgeons. The study found that while a reasoning-optimized proprietary LLM (O1) performed best, all models exhibited deficits in patient safety, hallucination avoidance, and clinical efficiency. A key finding was the "overacceptance" failure mode, where clinicians initially failed to identify flawed model outputs, suggesting that over-reliance on LLMs could pose significant risks in clinical decision-making.

Reveals a critical human-AI collaboration failure mode of "overacceptance" in cardiac surgery, where clinicians initially miss flawed LLM outputs, highlighting potential risks beyond simple model inaccuracy.

M. Leon, R. B. Feng, M. Q. Flores +5

Reasoning & Chain-of-ThoughtInterpretability & Mechanistic InterpRLHF & Preference Learning

Jan 1, 2026

AI in Patient Care: Evaluating Large Language Model Performance Against Evidence-Based Guidelines for Pulmonary Embolism

This study evaluated the performance of four LLMs (ChatGPT-4o, DeepSeek-V2, Gemini, and Grok) in applying the 2019 European Society of Cardiology guidelines for pulmonary embolism (PE) using ten open-ended questions based on a simulated PE case. The LLMs were scored by emergency physicians based on clinical accuracy and adherence to guidelines, revealing that ChatGPT-4o achieved the highest overall score, but performance varied across different clinical domains. While the LLMs show promise, the study highlights the need for further development to improve clinical integration and guideline compliance.

Quantifies the performance of four prominent LLMs in the context of applying evidence-based guidelines for pulmonary embolism, revealing both strengths and weaknesses in their clinical reasoning and guideline adherence.

Ö. F. Karakoyun, H. E. Koyuncuoğlu, Ömer H Sağnıç +3

Reasoning & Chain-of-ThoughtInterpretability & Mechanistic Interp

Dec 25, 2025

School of Cyber Science and EngineeringDec 25, 2025

Towards Patch-Based Noise Compression for Adversarial Attack Against Transformer-Based Visual Tracking

The paper introduces Patch-based Adversarial Noise Compression (PANC), a decision-based black-box adversarial attack method designed to efficiently attack Transformer-based visual trackers by exploiting patch-wise noise sensitivity. PANC uses a noise sensitivity matrix to dynamically adjust adversarial noise levels in different patches, optimizing noise distribution and reducing query counts. Experiments on OSTrack, STARK, TransT, and MixformerV2 across GOT-10k, TrackingNet, and LaSOT datasets demonstrate that PANC achieves a 162% improvement in attack effectiveness with only 45.7% of the queries compared to existing methods, while compressing noise to 10% of the original level.

Introduces a patch-based adversarial noise compression (PANC) method that significantly improves the efficiency and concealment of decision-based black-box adversarial attacks against Transformer-based visual trackers.

Peng Gao, Long Xu, Wen-Jia Tang +4

Architecture Design (Transformers, SSMs, MoE)Inference & QuantizationInterpretability & Mechanistic InterpRed-Teaming & Adversarial RobustnessComputer Vision

Dec 17, 2025

XIMED: A Dual-Loop Evaluation Framework Integrating Predictive Model and Human-Centered Approaches for Explainable AI in Medical Imaging

The paper introduces XIMED, a dual-loop evaluation framework for XAI methods in medical imaging, specifically chest X-ray classification. It evaluates LIME and SHAP explanations using both predictive model-centered metrics (sensitivity to model changes, feature identification) and human-centered metrics (trust, diagnostic agreement) with 97 medical experts. The study found that while SHAP significantly impacted diagnostic changes and both methods identified critical features, both LIME and SHAP negatively impacted contra-indicative agreement, with SHAP proving more effective in facilitating correct diagnostic changes when initial diagnoses were correct.

Introduces XIMED, a comprehensive dual-loop evaluation framework integrating predictive model-centered and human-centered evaluations for assessing XAI methods in medical imaging.

G. Karagoz, T. Ozcelebi, N. Meratnia

Interpretability & Mechanistic InterpEval Frameworks & BenchmarksComputer Vision

Dec 7, 2025

Who's to Blame? Unraveling Causal Drivers in Supply Chain Simulations with a Shapley Value Based Attribution Mechanism using Gaussian Process Emulator

This paper introduces a novel causal attribution framework for supply chain simulations that combines Shapley values with Gaussian process emulators to decompose simulation outputs into individual input effects. The approach addresses the challenge of explaining complex simulation outputs by quantifying the contribution of each input feature. Experiments on synthetic and real-world supply chain data demonstrate the framework's ability to efficiently identify root causes of anomalies.

Introduces a Shapley value-based causal attribution framework integrated with Gaussian process models to explain and decompose complex supply chain simulation outputs.

Hoiyi Ng, Yujing Lin, Xiaoyu Lu +1

Interpretability & Mechanistic InterpWorld Models & Planning

Dec 4, 2025

Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

This paper introduces a multimodal RGB-D gait recognition framework for Parkinson's disease (PD) detection that enhances accuracy and interpretability. The framework uses dual YOLOv11 encoders for feature extraction from RGB and Depth data, followed by a Multi-Scale Local-Global Extraction (MLGE) module and Cross-Spatial Neck Fusion to improve spatial-temporal representation. A frozen Large Language Model (LLM) then translates fused visual embeddings and metadata into clinically relevant textual explanations, improving transparency and bridging the gap between visual recognition and clinical understanding.

Introduces a novel vision-language framework for Parkinson's disease gait analysis that fuses multimodal RGB-D data with a frozen LLM to generate clinically meaningful textual explanations, enhancing both accuracy and interpretability.

Manar Alnaasan, Md Selim Sarowar, Sungho Kim2512.04425

Interpretability & Mechanistic InterpMultimodal ModelsComputer Vision

Nov 26, 2025

Mechanistic Interpretability for Transformer-based Time Series Classification

This paper adapts mechanistic interpretability techniques, including activation patching, attention saliency, and sparse autoencoders, to transformer-based time series classification models. The study aims to uncover the internal decision-making processes of these models, which are often obscured by their complexity. The experiments on benchmark datasets reveal causal graphs illustrating information flow, key attention heads, and temporal positions that drive correct classifications, along with interpretable latent features discovered via sparse autoencoders.

Demonstrates the applicability of mechanistic interpretability techniques, originally developed for NLP, to transformer-based time series classification, revealing internal causal structures and interpretable latent features.

Matiss Kalnare, Sofoklis Kitharidis, Thomas Bäck +12511.21514

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)

Nov 14, 2025

Chongqing Institute of EngineeringNov 14, 2025

Towards Knowledge-Guided Multimodal Agents for Interpretable Plant Health Diagnostics

The paper introduces a collaborative multimodal intelligence framework for plant disease diagnosis, integrating vision-language modeling, environmental sensing, and structured knowledge reasoning. This multi-agent system uses specialized agents for feature extraction, context-aware reasoning, and decision fusion to improve diagnostic accuracy in complex environments. The system achieves 54.5% overall diagnostic accuracy, outperforming single-modality baselines by 18.5%, and is deployed as a knowledge-distilled lightweight model within a WeChat Mini Program for real-time diagnosis.

Introduces a multi-agent architecture that integrates vision-language models, environmental sensors, and structured knowledge to improve plant disease diagnosis interpretability and accuracy.

Chengcheng Cai, Xudong Xia, Yonglin Tan +5

Interpretability & Mechanistic InterpMultimodal ModelsTool Use & Agents

Oct 18, 2025

Explainable GENAI Techniques to Interpret and Visualize LLM Reasoning

This paper introduces a framework for explainable GenAI that combines cognitive-inspired interpretability, retrieval-augmented generation, and semantic attribution mapping to decode and visualize LLM reasoning. The framework translates latent reasoning traces into human-understandable graphical narratives across token, layer, and decision levels. Experiments on reasoning datasets demonstrate improvements in faithfulness, causal coherence, and user interpretability, advancing ethically aligned and auditable AI systems.

Introduces a novel visualization engine that translates latent LLM reasoning traces into human-understandable graphical narratives, enhancing transparency across different levels of abstraction.

Elangovan Sivalingam, Tinakaran Chinnachamy

Interpretability & Mechanistic InterpReasoning & Chain-of-Thought

Oct 16, 2025

National and Kapodistrian University of AthensOct 16, 2025

ECG-XPLAIM: eXPlainable Locally-adaptive Artificial Intelligence Model for arrhythmia detection from large-scale electrocardiogram data

The paper introduces ECG-XPLAIM, a deep learning model for ECG classification that uses a 1D inception-style CNN to capture local waveform features and global rhythm patterns, enhanced with Grad-CAM for interpretability. Trained on MIMIC-IV and validated on PTB-XL, ECG-XPLAIM achieved high diagnostic performance (AUROC > 0.9) for multiple arrhythmias, demonstrating superior performance compared to baseline models and improved sensitivity over a ResNet model. The model's interpretability, achieved through Grad-CAM highlighting relevant ECG segments, addresses a key limitation of AI in clinical ECG analysis.

Introduces a novel, interpretable deep learning architecture, ECG-XPLAIM, for arrhythmia detection that combines high accuracy with explainability through Grad-CAM visualization of relevant ECG segments.

Panteleimon Pantelidis, Samuel Ruipérez-Campillo, Julia E. Vogt +9

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Oct 8, 2025

Vocabulary embeddings organize linguistic structure early in language model training

This paper investigates the evolution of vocabulary embedding geometry in LLMs during training by correlating input and output embeddings of Pythia 12B and OLMo 7B with semantic, syntactic, and frequency-based metrics using representational similarity analysis. The study reveals that vocabulary embedding geometry rapidly aligns with semantic and syntactic features early in training. Furthermore, high-frequency and function words converge faster than low-frequency words, which retain initial bias.

Demonstrates that linguistic structure emerges rapidly in vocabulary embeddings during LLM training, with distinct convergence rates based on word frequency and function.

Isabel Papadimitriou, Jacob Prince2510.07613

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic InterpOpen-Source Models & WeightsNatural Language Processing

Sep 22, 2025

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates

The paper addresses the problem of LLMs failing in real-world tool interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent by introducing a curriculum-inspired framework that uses structured reasoning templates. This framework guides LLMs through step-by-step instructions for generating function calls, improving their understanding of user goals and tool documentation. Experiments demonstrate a 3-12% relative improvement in tool-use accuracy compared to strong baselines, while also enhancing robustness, interpretability, and transparency.

Introduces a curriculum-inspired framework with structured reasoning templates to guide LLMs in generating function calls, thereby improving tool-use accuracy and interpretability.

Hy Dang, Tianyi Liu, Zhuofeng Wu +92509.18076

Reasoning & Chain-of-ThoughtTool Use & AgentsInterpretability & Mechanistic Interp

Jul 24, 2025

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

This paper investigates the faithfulness of chain-of-thought (CoT) reasoning in LLMs by extracting and manipulating monosemantic features using sparse autoencoders and activation patching. By swapping CoT-reasoning features into noCoT runs on GSM8K, the authors demonstrate a significant increase in answer log-probabilities in Pythia-2.8B, but not in Pythia-70M, indicating a scale-dependent effect. The study also reveals that CoT leads to higher activation sparsity and feature interpretability in the larger model, suggesting more modular internal computation.

Demonstrates that chain-of-thought prompting induces more interpretable and modular internal structures in larger LLMs, as evidenced by feature-level causal interventions.

Xi Chen, A. Plaat, N. V. Stein72507.22928

Interpretability & Mechanistic InterpReasoning & Chain-of-Thought

Jul 10, 2025

Jul 10, 2025·affiliated lab: Google Research

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory

This paper generalizes the connection between Direct Preference Optimization (DPO) and human choice theory, extending the normative framework underlying DPO. By reworking the standard human choice theory, the authors demonstrate that any compliant machine learning analytical choice model can be embedded within any human choice model. This generalization supports non-convex losses and provides a unifying framework for various DPO extensions like margins and length correction.

Establishes a generalized normative framework connecting DPO with human choice theory, demonstrating broader applicability and theoretical underpinnings for preference optimization.

Wenxuan Zhou, Shujian Zhang, B. Magdalou +42507.07855

RLHF & Preference LearningConstitutional AI & AI EthicsInterpretability & Mechanistic Interp

Lattice is designed for desktop

Interpretability & Mechanistic Interp

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Interpretability & Mechanistic Interp

Keywords

Top Labs in This Topic

Recent Papers

Search