April 19 – April 26, 2026

Interpretability & Mechanistic Interp - Weekly Roundup

88 papers published across 5 labs.

330% acceleration

Selected Labs publishing this week

Microsoft Research2 Amazon Science1 Tsinghua AI1 MIT CSAIL1 ETH1

Top Papers

Apr 25, 2026

Chathurangi Shyalika +2Apr 25, 2026

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Neurosymbolic grounding of LLMs in telemetry and knowledge graphs slashes expert-rated overclaims in industrial maintenance explanations by 93%, making AI assistants far more trustworthy in safety-critical settings.

Chathurangi Shyalika, Dhaval Patel, Amit P. Sheth

Interpretability & Mechanistic Interp Natural Language Processing Robotics & Embodied AI+1

Apr 23, 2026

Vipula Rawte +3Apr 23, 2026·also Adobe Research

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.

Vipula Rawte, Ryan A. Rossi, Franck Dernoncourt +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Florian Holeczek +4Apr 23, 2026

GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward

Uncover hidden GFlowNet training dynamics with GFlowState, a visual analytics tool that reveals how these models explore the sample space and shift sampling probabilities.

Florian Holeczek, A. Hinterreiter, A. Hernandez-Garcia +2

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design Training Efficiency & Optimization

Kaitlin Gili +3Apr 23, 2026

Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

Inductive biases make machine learning models better at spotting mechanistic reasoning in student discussions, even when those students are tackling new problems.

Kaitlin Gili, Mainak Nistala, Kristen Wendell +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Timothy Murphy +2Apr 23, 2026

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Deepfakes betray themselves through subtle irregularities in the timing of facial movements, especially when expressing emotions, offering a new avenue for detection.

Timothy Murphy, J. Cook, H. Cuve

Computer Vision Interpretability & Mechanistic Interp

All Papers (88)

Apr 25, 2026

Chathurangi Shyalika +2Apr 25, 2026

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Chathurangi Shyalika, Dhaval Patel, Amit P. Sheth

Interpretability & Mechanistic Interp Natural Language Processing Robotics & Embodied AI+1

Apr 23, 2026

Vipula Rawte +3Apr 23, 2026·also Adobe Research

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.

Vipula Rawte, Ryan A. Rossi, Franck Dernoncourt +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Florian Holeczek +4Apr 23, 2026

GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward

Uncover hidden GFlowNet training dynamics with GFlowState, a visual analytics tool that reveals how these models explore the sample space and shift sampling probabilities.

Florian Holeczek, A. Hinterreiter, A. Hernandez-Garcia +2

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design Training Efficiency & Optimization

Kaitlin Gili +3Apr 23, 2026

Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

Inductive biases make machine learning models better at spotting mechanistic reasoning in student discussions, even when those students are tackling new problems.

Kaitlin Gili, Mainak Nistala, Kristen Wendell +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Timothy Murphy +2Apr 23, 2026

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Deepfakes betray themselves through subtle irregularities in the timing of facial movements, especially when expressing emotions, offering a new avenue for detection.

Timothy Murphy, J. Cook, H. Cuve

Computer Vision Interpretability & Mechanistic Interp

Lynn Vonderhaar +3Apr 23, 2026

Verifying Machine Learning Interpretability Requirements through Provenance

Quantifiable functional requirements derived from ML provenance can bridge the gap between abstract interpretability goals and verifiable model behavior.

Lynn Vonderhaar, J. Couder, Daryela Cisneros +1

Interpretability & Mechanistic Interp

Isabel Kurth +2Apr 23, 2026

Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2

Despite their architectural differences, Transformer-based genome language models can provide equally reliable biological insights as CNNs, as revealed by attention-based explainability methods.

Isabel Kurth, Paulo Yanez Sarmiento, Bernhard Y. Renard

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Yilang Liu +4Apr 23, 2026

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

Multi-task RL agents solving related navigation tasks underwater rely on a surprisingly small fraction of their weights (1.5%) to differentiate between tasks.

Yilang Liu, Melvin Laux, M. D. L. Álvarez +2

Interpretability & Mechanistic Interp RLHF & Preference Learning Robotics & Embodied AI

Vishal RajputApr 23, 2026

Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

Supervised learning is fundamentally flawed: models *must* retain sensitivity to irrelevant features, opening the door to adversarial attacks and other vulnerabilities.

Vishal Rajput

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

Jon-Paul CacioliApr 23, 2026

Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

Cross-entropy loss isn't just a detail – it's the unsung hero behind how well energy probes work in predictive coding networks, accounting for up to 66% of the probe-softmax gap.

Jon-Paul Cacioli

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Pedro Seber +1Apr 23, 2026

Improving Performance in Classification Tasks with LCEN and the Weighted Focal Differentiable MCC Loss

Forget cross-entropy: a differentiable MCC loss function can boost your classification accuracy by nearly 5% on F1 score and 8.5% on MCC.

Pedro Seber, Richard D. Braatz

Interpretability & Mechanistic Interp Training Efficiency & Optimization

O. O. Sarumi +2Apr 23, 2026

Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

Modeling annotator-specific explanations substantially boosts NLI prediction accuracy and provides a richer understanding of disagreement compared to simply conditioning on annotator identity.

O. O. Sarumi, Charles Welch, Daniel Braun

Interpretability & Mechanistic Interp Natural Language Processing

Michael Bouzinier +4Apr 23, 2026

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

Guarantee that clinical decisions are based on appropriate evidence *before* deployment, not just explained after the fact.

Michael Bouzinier, S. Trifonov, Michael Chumack +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

L. Çağlar +2Apr 23, 2026

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Despite achieving comparable accuracy, humans and deep vision models exhibit fundamentally different error patterns, revealing distinct inductive biases that can be quantified through directional confusion analysis and Rate-Distortion geometry.

L. Çağlar, Pedro Mediano, Baihan Lin

Computer Vision Interpretability & Mechanistic Interp

Frederik L. Dennig +1Apr 23, 2026

Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

Parametric projections, like UMAP and t-SNE, can have surprisingly unstable local neighborhoods, leading to unpredictable shifts in the 2D layout even with small input variations.

Frederik L. Dennig, Daniel A. Keim

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

C. Mbonu +3Apr 23, 2026

an interpretable vision transformer framework for automated brain tumor classification

Achieve near-perfect brain tumor classification with a Vision Transformer, unlocking clinically interpretable insights via attention rollouts.

C. Mbonu, T. Belonwu, Okwuchukwu Ejike Chukwuogo +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Technische Hochschule Nürnberg Georg Simon OhmApr 23, 2026

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

Turns out where you look in Wav2vec 2.0's representations *really* matters: intelligibility lives in the layers, while articulation problems hide in the time steps.

Natalie Engert, Dominik Wagner, K. Riedhammer +1

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Apr 23, 2026·also Sinequa by ChapsVision

From Tokens to Concepts: Leveraging SAE for SPLADE

SPLADE models can ditch their token-based vocabularies for a latent semantic space learned by Sparse Auto-Encoders, maintaining retrieval performance while boosting efficiency.

Yuxuan Zong, Mathias Vast, Basile Van Cooten +2

Interpretability & Mechanistic Interp Natural Language Processing Recommendation & Information Retrieval

Hieu Man +5Apr 23, 2026

Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

Achieve state-of-the-art authorship attribution and few-shot AI-generated text detection by explicitly disentangling style and content with a novel explainable VAE architecture.

Hieu Man, Van-Cuong Pham, Nghia Trung Ngo +3

Interpretability & Mechanistic Interp Natural Language Processing

Apr 22, 2026

Kyushu UniversityApr 22, 2026

QuanForge: A Mutation Testing Framework for Quantum Neural Networks

QuanForge reveals that targeted mutation testing can significantly enhance the reliability of Quantum Neural Networks by pinpointing their vulnerabilities.

Minqi Shao, Shangzhou Xia, Jianjun Zhao

Interpretability & Mechanistic Interp

Apr 22, 2026·also NeuroSpin

Improving clinical interpretability of linear neuroimaging models through feature whitening

Whitening neuroimaging features can transform linear models from black boxes into interpretable tools for understanding brain pathology.

Sara Petiton, Antoine Grigis, Raphaël Vock +1

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Meteorological Institute MunichApr 22, 2026·also Deutsches Zentrum für Luft-und Raumfahrt, Ludwig-Maximilians-Universität

Mechanistic Interpretability Tool for AI Weather Models

Unlock the secrets of AI weather models: a new tool reveals how latent representations encode interpretable meteorological features.

Kirsten I. Tempest, Matthias Beylich, George C. Craig

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Changho Han +31Apr 22, 2026

Surrogate modeling for interpreting black-box LLMs in medical predictions

LLMs may encode dangerous biases and inaccuracies, revealing a critical need for interpretability in medical applications.

Changho Han, Songsoo Kim, Dong Won Kim +29

Interpretability & Mechanistic Interp

Rickmer Schulte +1Apr 22, 2026

Rethinking Intrinsic Dimension Estimation in Neural Representations

Common methods for estimating the complexity of neural network representations are fundamentally flawed, potentially invalidating a large body of prior work.

Rickmer Schulte, David Rugamer

Interpretability & Mechanistic Interp

Weizhi Nie +1Apr 22, 2026

Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury

Finally, a deep learning model for AKI prediction that doesn't just predict, but tells you *why*, by tracing the causal chain of physiological events.

Weizhi Nie, Haolin Chen

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Apr 22, 2026

Diagnosing CFG Interpretation in LLMs

LLMs maintain surface syntax but collapse on structural semantics, revealing critical gaps in their ability to function as reliable agents in complex environments.

Hanqi Li, Lu Chen

Interpretability & Mechanistic Interp Tool Use & Agents

Yuhang Wu +3Apr 22, 2026

LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

Forget hand-tuning layer configurations: LayerTracer reveals the precise layers where LLMs learn and break, paving the way for automated architecture optimization.

Yuhang Wu, Qinyuan Liu, Qiuyang Zhao +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Apr 22, 2026·also Microsoft Research, California State Polytechnic University

Auditing and Controlling AI Agent Actions in Spreadsheets

Users who actively participate in an AI agent's spreadsheet execution not only improve task outcomes, but also gain a deeper understanding and feel more ownership over the results.

Sadra Sabouri, Zeinabsadat Saghi, Run Huang +4

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Xuelin Zhang +3Apr 22, 2026·also HKUST

Meta Additive Model: Interpretable Sparse Learning With Auto Weighting

Forget hand-tuning loss functions: this meta-learning approach automatically learns optimal sample reweighting for sparse additive models, boosting robustness and accuracy.

Xuelin Zhang, Xinyue Liu, Lingjuan Wu +1

Interpretability & Mechanistic Interp Training Efficiency & Optimization

Apr 22, 2026·also UCSD

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Despite architectural differences, language models exhibit convergent evolution by learning similar periodic features for number representation, but achieving geometric separability depends on subtle training factors.

Deqing Fu, Tianyi Zhou, Mikhail Belkin +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing+1

Apr 21, 2026

University of CataniaApr 21, 2026·also Polish Academy of Sciences, Poznan University of Technology

PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models

Stop guessing what explanations users want: PREF-XAI learns personalized explanations by directly modeling user preferences over rule-based explanations.

Salvatore Greco, Jacek Karolczak, Roman Słowiński +1

Interpretability & Mechanistic Interp RLHF & Preference Learning

Apr 21, 2026·also Archimedes/Athena Research Center

TACENR: Task-Agnostic Contrastive Explanations for Node Representations

Node embeddings aren't just about node attributes: proximity and structural features play a surprisingly large role in shaping them.

Vasiliki Papanikou, Evaggelia Pitoura

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Apr 21, 2026

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

Uncover hidden performance disparities in your ML models with FairTree, a new auditing tool that pinpoints fairness issues across continuous, categorical, and ordinal features while dissecting bias and variance contributions.

Rudolf Debelak

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Gonzalo Nápoles +2Apr 21, 2026·also Universidad de Talca

Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

A surprising 30% of images in the Derm7pt dermoscopy dataset have conflicting concept profiles, imposing a hard accuracy ceiling of 92.1% on Concept Bottleneck Models.

Gonzalo Nápoles, Isel Grau, Yamisleydi Salgueiro

Computer Vision Interpretability & Mechanistic Interp

Manav PandeyApr 21, 2026

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

LLMs aren't just wrong sometimes, they *know* they're wrong and agree with you anyway, thanks to a surprisingly compact "sycophancy-lying circuit" that evades current alignment techniques.

Manav Pandey

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Julian Skifstad +2Apr 21, 2026

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

LLMs are surprisingly linear, enabling precise, closed-loop control of behavior via model-based linear optimal control of activations.

Julian Skifstad, Xinyue Annie Yang, Glen Chou

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Haoyang Chen +5Apr 21, 2026·also Monash, Quantstamp

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

LLMs signal their internal certainty during answer decoding through predictable attention patterns on their own reasoning traces.

Haoyang Chen, Yi Liu, Jianzhi Shao +3

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Nurkhan Laiyk +4Apr 21, 2026

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

English-to-X translation skills can be distilled into function vectors that generalize to Y, Z, and other languages, suggesting a shared underlying translation mechanism in multilingual LLMs.

Nurkhan Laiyk, Gerard I. Gállego, Gerard I. G'allego +2

Interpretability & Mechanistic Interp Natural Language Processing

HSE UniversityApr 21, 2026

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

Unlocking authorship attribution: Rank-Turbulence and Jensen-Shannon Delta offer interpretable and effective alternatives to traditional methods, enhancing close reading and validation of results.

Dmitry Pronin, D.D. Pronin, Evgeny Kazartsev

Interpretability & Mechanistic Interp Natural Language Processing

François Remy +1Apr 21, 2026

Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference

Unlock the black box of late-interaction retrieval models: Diagnosable ColBERT lets you directly inspect what the model "understands" by aligning token embeddings to a clinically-grounded latent space.

François Remy, Franccois Remy

Interpretability & Mechanistic Interp Natural Language Processing Recommendation & Information Retrieval

Qin Dai +2Apr 21, 2026

Cell-Based Representation of Relational Binding in Language Models

LLMs use a surprisingly structured "Cell-based Binding Representation" to track entities and relations in discourse, opening the door to targeted interventions and improved relational reasoning.

Qin Dai, Benjamin Heinzerling, Kentaro Inui

Interpretability & Mechanistic Interp Natural Language Processing Reasoning & Chain-of-Thought

(Corresponding authors: Bingguo Liu)Apr 21, 2026·also V) setting. Figure 6: Fine-grained

When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

A simple difference in IoU scores between class-specific and class-agnostic heatmaps can reliably flag potentially erroneous predictions in industrial defect detection, even achieving 100% recall of false negatives with adversarial enhancement.

Hang-Cheng Dong, Yuhao Jiang, Yibo Jiao +9

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Univ Gustave EiffelApr 21, 2026·also Institut Polytechnique de Paris

Deep sprite-based image models: An analysis

Sprite-based image models, long overlooked, can now achieve state-of-the-art unsupervised segmentation with linear scaling, thanks to a deep learning approach.

Zeynep Sonat Baltacı, Romain Loiseau, Mathieu Aubry

Computer Vision Interpretability & Mechanistic Interp

Oleg Solozobov +1Apr 21, 2026

Governed Auditable Decisioning Under Uncertainty: Synthesis and Agentic Extension

Agentic AI systems introduce fundamental breaks in governance frameworks, making it difficult to reconstruct what happened or why decisions were made.

Oleg Solozobov, Oleg Solozobov

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Yusuf Çelebi +11Apr 21, 2026

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Forget heuristics: RDP LoRA leverages the hidden geometry of LLMs to pinpoint the most impactful layers for parameter-efficient fine-tuning, boosting performance while adapting fewer parameters.

Yusuf Çelebi, Yusuf cCelebi, Yağız Asker +9

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing+1

Nicholas Popovivc +1Apr 21, 2026

Tracing Relational Knowledge Recall in Large Language Models

Forget scaling laws: the secret to extracting relational knowledge from LLMs lies in the specificity and connectedness of the relations themselves, and how their signals are distributed across attention heads.

Nicholas Popovivc, Michael Farber

Interpretability & Mechanistic Interp Natural Language Processing

Amazon ScienceApr 21, 2026

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

SpeechLLMs' hallucinations betray themselves in their attention patterns, offering a new way to detect these errors without needing expensive human-labeled data.

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Kun Wang +8Apr 21, 2026

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Projector fine-tuning, commonly used for aligning MLLMs, unexpectedly introduces backdoor vulnerabilities with activation mechanisms distinct from those in text-only LLMs.

Kun Wang, Cheng Qian, Cheng Qian +6

Interpretability & Mechanistic Interp Multimodal Models Red-Teaming & Adversarial Robustness

Hugo Lyons Keenan +2Apr 21, 2026

Mechanistic Anomaly Detection via Functional Attribution

Neural networks can be compromised even when their outputs appear correct; this new method spots the hidden anomalies by checking if a model's decisions can be explained by its past training.

Hugo Lyons Keenan, Christopher Leckie, Sarah M. Erfani

Interpretability & Mechanistic Interp

Minghua Zheng +4Apr 21, 2026

Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence

Turns out, the best colony counter struggles not because of the model, but because all those colonies look too darn similar.

Minghua Zheng, Na Helian, P. C. R. Lane +2

Computer Vision Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

E. KnightsApr 21, 2026

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

Surprisingly, ViTs can be made more human-like in their attention patterns, for free, simply by fine-tuning on human eye-tracking data, without hurting accuracy.

E. Knights

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Palawat Busaranuvong +5Apr 21, 2026·also Worcester Polytechnic Institute

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

A 4B-parameter model can outperform GPT-5.1 in wound infection classification by distilling its reasoning and fine-tuning with reinforcement learning, offering a path to more efficient and interpretable medical image analysis.

Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu +3

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

VNU University of Engineering and TechnologyApr 21, 2026·also TU Delft

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Counterfactual explainers for recommender systems don't generalize as well as we thought: their effectiveness and sparsity depend heavily on the evaluation setting, and graph-based methods struggle to scale.

Quang-Huy Nguyen, Thanh-Hai Nguyen, Khac-Manh Thai +6

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Recommendation & Information Retrieval

Apr 21, 2026

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

LLMs have "pure incorrectness" features that correlate with wrong answers but don't actually *cause* them, suggesting that simply identifying error-correlated activations isn't enough for effective intervention.

Het Patel, Tiejin Chen, Hua Wei +1

Interpretability & Mechanistic Interp

Apr 21, 2026·also Adelaide University, RAE Decode

TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

LLM agents often say one thing, believe another, and do something completely different, especially when interacting with other agents.

Ziyi Wang, Chen Zhang, Wenjun Peng +1

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought Tool Use & Agents

Guray Ozgur +7Apr 21, 2026

ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers

Turns out, your pre-trained face recognition ViT already knows which faces are high quality, just by looking at the attention maps.

Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Apr 20, 2026

Shanghai Qizhi InstituteApr 20, 2026·also Nanjing Whale Cloud, SEU, Xiamen University

State Transfer Reveals Reuse in Controlled Routing

Fixed-interface transfer can achieve high routing accuracy without retraining, revealing deeper insights into model behavior than previously understood.

Yanzhen Lu, Zhicheng Qian, Muchen Jiang

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Apr 20, 2026·also University of Zagreb

Reasoning Models Know What's Important, and Encode It in Their Activations

Model activations reveal a hidden layer of reasoning importance that surface-level analyses completely overlook.

Yaniv Nikankin, Martin Tutek, Tomer Ashuach +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

University of CincinnatiApr 20, 2026

Exploring Concreteness Through a Figurative Lens

LLMs can distinguish between literal and figurative meanings early in their processing, revealing a surprising geometric structure that simplifies figurative-language classification.

Saptarshi Ghosh, Tianyu Jiang

Interpretability & Mechanistic Interp Natural Language Processing

Prashant C. RajuApr 20, 2026

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Predicting steerability with near-perfect accuracy while detecting drift more effectively than existing methods could transform how we monitor and control language models in real-world applications.

Prashant C. Raju

Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

BITS PilaniApr 20, 2026

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

LLMs can self-correct reasoning errors mid-generation by simply watching their own residual stream for "phase shifts" and nudging the KV-cache, outperforming even prompted self-correction.

Manan Gupta, Dhruv Kumar

Inference & Quantization Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Sarwan Ali +1Apr 20, 2026

Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification

MS-RCGR not only preserves complete sequence information but also enhances classification performance across diverse analytical paradigms, making it a game-changer for biological sequence analysis.

Sarwan Ali, Taslim Murad

Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

Etienne Tajeuna +3Apr 20, 2026

CAARL: In-Context Learning for Interpretable Co-Evolving Time Series Forecasting

Unlock the black box of time series forecasting: CAARL uses LLMs to generate interpretable narratives that explain *why* predictions change.

Etienne Tajeuna, Patrick Asante Owusu, Armelle Brun +1

Interpretability & Mechanistic Interp

Pooyan Khosravinia +2Apr 20, 2026

Causally-Constrained Probabilistic Forecasting for Time-Series Anomaly Detection

Causal structural priors can significantly enhance both the robustness and interpretability of anomaly detection in complex multivariate time series.

Pooyan Khosravinia, João Gama, Bruno Veloso

Interpretability & Mechanistic Interp

Paris-Panthéon-Assas University ParisApr 20, 2026·also IRIT -Toulouse University

A Sugeno Integral View of Binarized Neural Network Inference

Binarized neural networks can be understood through the lens of Sugeno integrals, revealing a structured way to interpret neuron decisions and input interactions.

Ismaïl Baaj, Henri Prade

Inference & Quantization Interpretability & Mechanistic Interp

Difan Jiao +6Apr 20, 2026

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Harnessing the internal states of LLMs, SIREN outperforms traditional guard models while using a fraction of the parameters, revolutionizing harmful content detection.

Difan Jiao, Yilun Liu, Ye Yuan +4

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Mateusz Cedro +1Apr 20, 2026

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

Narrative-based explanations in XAI could dramatically improve human comprehension of model predictions, surpassing traditional static feature lists.

Mateusz Cedro, David Martens

Interpretability & Mechanistic Interp Natural Language Processing

University of Artificial IntelligenceApr 20, 2026·also Georgetown, Tohoku, UCSD, UTokyo

Dual Alignment Between Language Model Layers and Human Sentence Processing

Later layers of LLMs capture cognitive effort in syntactically challenging sentences better than earlier layers, but still miss the mark compared to human processing.

Tatsuki Kuribayashi, Alex Warstadt, Yohei Oseki +1

Interpretability & Mechanistic Interp Natural Language Processing

Tsinghua AIApr 20, 2026·also Kyoto

Understanding the Prompt Sensitivity

LLMs disperse similar prompts instead of clustering them, leading to significant prompt sensitivity that challenges stability and reliability.

Yang Liu, Chenhui Chu

Interpretability & Mechanistic Interp Natural Language Processing

Apr 20, 2026

PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues

Achieve more human-like negotiation from dialogue agents by explicitly modeling and reasoning about emotions with interpretable chain-of-thought prompting.

Prajwal Vijay Kajare, Priyanshu Priya, Bikash Santra +1

Interpretability & Mechanistic Interp Natural Language Processing RLHF & Preference Learning

Apr 20, 2026·also UW-Madison

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

Task-aware neuron steering in VLMs is now possible without gradients, unlocking better performance and interpretability across diverse multimodal tasks.

Qidong Wang, Ming Jiang

Interpretability & Mechanistic Interp Multimodal Models

Sanaz Sadat Hosseini +4Apr 20, 2026·also UNC

Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA

Forget top-down AI deployment: this study shows how a community-led approach to AI-powered wildfire risk assessment can build trust and drive adoption by prioritizing local context and user experience.

Sanaz Sadat Hosseini, M. Azarbayjani, Mona Azarbayjani +2

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Aarhus UniversityApr 20, 2026·also CU Boulder, KIST

Decision-Aware Attention Propagation for Vision Transformer Explainability

DAP transforms how we interpret Vision Transformers by producing attribution maps that are not only more faithful but also significantly more class-sensitive than traditional methods.

Sehyeong Jo, Gangjae Jang, Haesol Park

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

Isaac Llorente-SaguerApr 20, 2026

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Even after surgically removing refusal behavior from LLMs, a stable, linearly decodable representation of harmful intent persists in their residual streams.

Isaac Llorente-Saguer

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Apr 20, 2026·also Integral, Kyoto, Openwork

Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

LLMs can generate higher-quality, more consistent topics from text data, leading to better insights about external outcomes like employee morale.

Yura Yoshida, M. Kanai, M. Nakayama +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Ahson Saiyed +2Apr 20, 2026

Towards Understanding the Robustness of Sparse Autoencoders

Integrating Sparse Autoencoders into transformer models can slash jailbreak success rates by up to 5x, reshaping our understanding of model robustness against adversarial attacks.

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

MIT CSAILApr 20, 2026·also UW Allen School of CSE, UW Department of Philosophy

Navigating the Conceptual Multiverse

Uncover the hidden assumptions baked into LLM responses with a new interactive system that lets you explore alternative conceptual framings and values.

Andre Ye, Jenny Y. Huang, Alicia Guo +3

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Ziyang LiuApr 20, 2026

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Forget parallel probing – a commit-open protocol using SAE feature traces can reliably expose hosted LLM providers silently substituting cheaper models, even against adaptive attacks.

Ziyang Liu

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

S. Sheikhi +4Apr 20, 2026·also Oulu

ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks

You can achieve near-perfect intrusion detection in 5G networks *and* get human-interpretable rules, proving that transparency doesn't have to sacrifice performance.

S. Sheikhi, Saeid Sheikhi, Panos Kostakos +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Francesco Vitale +4Apr 20, 2026·also Napoli

Enhancing Anomaly-Based Intrusion Detection Systems with Process Mining

Process mining can turn black-box intrusion detection systems into transparent, prioritized alert generators without sacrificing accuracy.

Francesco Vitale, Francesco Grimaldi, Massimiliano Rak +2

Interpretability & Mechanistic Interp

Apr 20, 2026

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

LLM-generated debugging explanations are often vague or misleading, but this work shows you can make them dramatically better by carefully curating the context provided to the LLM.

Julius Porbeck, Christian Medeiros Adriano, C. Adriano +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Santosh Kesiraju +6Apr 20, 2026·also Brno University of Technology

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

Multilingual and multimodal embeddings leak way more lexical information than you think – FLiP can recover 75% of the original text.

Santosh Kesiraju, Bolaji Yusuf, vSimon Sedl'avcek +4

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Apr 20, 2026

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

LLMs have "hallucination neurons" for specific citation fields, and silencing them reduces fabrication.

Yuefei Chen, Yihao Quan, Yihao Quan +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 20, 2026·also ETH

Probing for Reading Times

Early layers of language models capture human-like processing signatures in reading, rivaling traditional measures like surprisal in predicting initial eye movements.

Tianyang Xu, Mario Giulianelli, Karolina Stanczak

Interpretability & Mechanistic Interp Natural Language Processing

Microsoft ResearchApr 20, 2026

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Token-level attribution struggles to pinpoint the causes of LLM failures in realistic settings, suggesting current interpretability tools may not be up to the task of debugging complex model behaviors.

Rongyuan Tan, Jue Zhang, Zhuozhao Li +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 19, 2026

Prabhudarshi Nayak +4Apr 19, 2026·also Cisco Systems Inc, Dell Inc, Institute of Management and Information, LTM Limited +1

Explainable Attention-Based LSTM Framework for Early Detection of AI-Assisted Ransomware via File System Behavioral Analysis

Attention-based LSTMs, coupled with XAI, can spot AI-assisted ransomware early by pinpointing subtle, yet critical, file system behavioral patterns.

Prabhudarshi Nayak, Gogulakrishnan Thiyagarajan, Debashree Priyadarshini +2

Interpretability & Mechanistic Interp

University of CincinnatiApr 19, 2026

Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

Over 78% of medical students reported improved clinical reasoning skills through a persona-driven approach to requirements engineering in explainable MAES.

Weibing Zheng, Laurah Turner, Jess Kropczynski +3

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Yujia Zheng +4Apr 19, 2026

Diverse Dictionary Learning

Even when you can't fully identify latent variables, provably recovering their set-theoretic relationships unlocks structured understanding of the hidden world.

Yujia Zheng, Zijian Li, Shunxing Fan +2

Interpretability & Mechanistic Interp

Search

Interpretability & Mechanistic Interp - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (88)