Interpretability & Mechanistic Interp
Safety & AlignmentUnderstanding the internal mechanisms of neural networks through circuit analysis, feature visualization, and mechanistic interpretability.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces a model-hardware co-design framework for CNN-based SAR ATR that jointly optimizes adversarial robustness, model compression, and FPGA accelerator design. The framework uses hardware-guided structured pruning, informed by a hardware performance model, to explore robustness-efficiency trade-offs. Experiments on MSTAR and FUSAR-Ship datasets show the framework produces models up to 18.3x smaller with 3.1x fewer MACs while preserving robustness, and the FPGA implementation achieves significant latency and energy efficiency improvements compared to CPU/GPU baselines.
Develops a model-hardware co-design framework that unifies robustness-aware model compression and FPGA accelerator design for CNN-based SAR ATR, enabling exploration of robustness-efficiency trade-offs.
This paper introduces a calibrated Bayesian deep learning framework for medical imaging decision support, addressing the critical need for reliable uncertainty quantification in AI-assisted diagnostics. The framework combines a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) during training, which penalizes high-confidence errors and low-confidence correct predictions, with a post-hoc Dual Temperature Scaling (DTS) strategy to refine the posterior distribution. Validated on pneumonia screening, diabetic retinopathy detection, and skin lesion identification, the approach demonstrates improved calibration, robust performance in data-scarce scenarios, and effectiveness on imbalanced datasets.
Introduces a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) and Dual Temperature Scaling (DTS) strategy to improve calibration and uncertainty quantification in Bayesian deep learning models for medical imaging.
The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.
Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.
The paper introduces a novel parameter-efficient fine-tuning (PEFT) method called \methodname{} that adapts large pretrained models by learning per-neuron thresholds and gains in activation space, inspired by neuromodulation. This approach aims to change the mode of computation by selecting and rescaling existing computations rather than rewriting weights, offering improved interpretability. Experiments on MNIST and rotated MNIST demonstrate that \methodname{} can improve accuracy over a frozen baseline with significantly fewer trainable parameters than LoRA, while also enabling neuron-level attribution and conditional computation.
Introduces \methodname{}, a parameter-efficient fine-tuning method that learns per-neuron thresholds and gains in activation space to adapt pretrained models by changing the mode of computation.
This paper introduces Hierarchical Sparse Autoencoders (HSAEs) to explicitly model the hierarchical relationships between features extracted from LLMs, addressing the limitation of standard SAEs that treat features in isolation. HSAEs incorporate a structural constraint loss and random feature perturbation to encourage alignment between parent and child features in the learned hierarchy. Experiments across various LLMs and layers demonstrate that HSAEs recover semantically meaningful hierarchies while preserving reconstruction fidelity and interpretability.
Introduces Hierarchical Sparse Autoencoders (HSAEs) to learn and represent the hierarchical relationships between features extracted from LLMs.
This paper extends crosscoder model diffing to cross-architecture comparisons, enabling the unsupervised discovery of behavioral differences between LLMs with different architectures. They introduce Dedicated Feature Crosscoders (DFCs), an architectural modification to improve the isolation of unique features in one model compared to another. Applying this technique, they identify features such as CCP alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B.
Introduces Dedicated Feature Crosscoders (DFCs), an architectural modification to enhance crosscoder model diffing for isolating features unique to individual models in cross-architecture comparisons.
The paper introduces ProtoMech, a framework for mechanistic interpretability of protein language models (pLMs) that uses cross-layer transcoders to learn sparse latent representations capturing the model's full computational circuitry. By jointly analyzing representations across layers of ESM2, ProtoMech identifies compressed circuits that retain significant performance on protein family classification and function prediction while using only a small fraction of the latent space. Steering along these identified circuits enables high-fitness protein design, demonstrating the framework's utility in understanding and manipulating pLM behavior.
Introduces ProtoMech, a novel framework that discovers computational circuits in protein language models by learning sparse, cross-layer latent representations.
The paper introduces the Prototype Transformer (ProtoT), an autoregressive language model architecture that uses prototypes (parameter vectors) instead of self-attention to improve interpretability. ProtoT establishes two-way communication between the input sequence and the prototypes, causing the prototypes to capture nameable concepts during training and creating interpretable communication channels. Experiments demonstrate that ProtoT scales linearly with sequence length, performs well on text generation and downstream tasks (GLUE), and exhibits robustness to input perturbations while providing interpretable pathways for understanding robustness and sensitivity.
Introduces the Prototype Transformer, a novel autoregressive language model architecture designed for interpretability by using prototypes to capture nameable concepts and create interpretable communication channels.
The paper investigates how reasoning behaviors in LLMs influence reasoning quality by analyzing behavioral patterns in model responses. They find that injecting specific reasoning behavior patterns can significantly improve reasoning outcomes. Based on this, they propose two parameter-free optimization methods, InjectCorrect (imitating patterns from past correct answers) and InjectRLOpt (using a learned value function to generate behavior injectants), to steer the reasoning process.
Introduces InjectRBP, a novel framework for steering LLM reasoning by structurally injecting observed behavioral patterns, without requiring parameter updates.
The paper introduces a rule-based computational model for Gaidhlig morphology, addressing the challenge of limited data availability for low-resource languages that hinders the application of neural models. The model leverages data from Wiktionary and uses SQL queries to identify lexical patterns, constructing a declarative rule-base for generating inflected word forms via Python utilities. This approach demonstrates that rule-based systems can effectively utilize limited data while providing interpretability and supporting the development of educational tools.
Presents a functional rule-based system for Gaidhlig morphology using Wiktionary data and SQL queries to generate inflected word forms.
The paper introduces V-SHiNE, a browser-based virtual smart home environment designed to facilitate the evaluation of explainable AI (XAI) methods in the context of smart home automation. V-SHiNE enables researchers to configure realistic smart home environments, simulate user behaviors, integrate custom explanation engines, and log user interactions. A user study with 159 participants demonstrates the framework's utility for assessing the impact and quality of different explanation strategies.
Introduces V-SHiNE, a novel browser-based simulation framework, to enable scalable and reproducible evaluation of XAI methods within virtual smart home environments.
The paper investigates whether neural world models truly learn physical laws or rely on statistical shortcuts, particularly under out-of-distribution shifts. They introduce PhyIP, a non-invasive evaluation protocol that assesses the linear decodability of physical quantities from frozen latent representations, contrasting it with adaptation-based methods. Their results show that when self-supervised learning achieves low error, latent physical structures are linearly accessible and robust to OOD shifts, while adaptation-based evaluations can collapse this structure, suggesting that non-invasive probes are more accurate for evaluating physical world models.
Introduces PhyIP, a non-invasive evaluation protocol, to accurately assess the linear accessibility of physical quantities in frozen latent representations of world models, demonstrating its superiority over adaptation-based methods.
The paper introduces Distribution Map (DMAP), a novel method for representing text using next-token probability distributions from LLMs by mapping text to samples in the unit interval that encode rank and probability. DMAP addresses the limitations of perplexity by accounting for context and the shape of the conditional distribution. The authors demonstrate DMAP's utility in validating generation parameters, detecting machine-generated text via probability curvature, and performing forensic analysis of models fine-tuned on synthetic data.
Introduces DMAP, a mathematically grounded method for representing text as a distribution of samples in the unit interval based on next-token probability distributions from LLMs, enabling efficient and model-agnostic text analysis.
This paper investigates the problem of unstable feature importance estimates in expressive machine learning models, which hinders their use in scientific discovery. The authors theoretically analyze the bias-variance tradeoff in aggregating feature importance estimates, demonstrating that ensembling at the model level yields more accurate estimates by reducing excess risk. They empirically validate their theoretical findings on benchmark datasets and a large-scale proteomic study from the UK Biobank.
Demonstrates theoretically and empirically that ensembling at the model level, rather than aggregating individual model explanations, provides more accurate feature importance estimates, especially for expressive models.
This paper investigates the internal representations of high-level musical concepts within audio diffusion models using activation patching, revealing that a small subset of attention layers controls distinct semantic concepts. They then use Contrastive Activation Addition and Sparse Autoencoders in these key layers to achieve more precise control over audio generation. The authors demonstrate the ability to manipulate specific musical elements like tempo and mood by steering activations in the identified layers.
Demonstrates precise control over generated audio by identifying and steering activations in specific attention layers of audio diffusion models.
This paper introduces SpaTeoGL, a spatiotemporal graph learning framework that constructs window-level spatial graphs of iEEG electrode interactions and a temporal graph linking time windows based on spatial graph similarity. The method uses a smooth graph signal processing formulation solved via alternating block coordinate descent, providing convergence guarantees. Experiments on a multicenter iEEG dataset demonstrate that SpaTeoGL achieves competitive SOZ localization performance compared to horizontal visibility graphs and logistic regression, while also enhancing non-SOZ identification and offering interpretable insights into seizure dynamics.
Introduces a novel spatiotemporal graph learning framework, SpaTeoGL, to model and interpret seizure onset zone dynamics from iEEG data.
This paper investigates jailbreaking attacks on LLMs by analyzing differences in internal representations between jailbreak and benign prompts across multiple open-source models (GPT-J, LLaMA, Mistral, Mamba). They propose a tensor-based latent representation framework to capture structure in hidden activations, enabling jailbreak detection without fine-tuning or auxiliary LLMs. By selectively bypassing high-susceptibility layers in LLaMA-3.1-8B, the method blocks 78% of jailbreak attempts while preserving 94% of benign behavior, demonstrating the potential for inference-time interventions.
Introduces a tensor-based latent representation framework for detecting and disrupting jailbreak attacks by analyzing and manipulating internal activations of LLMs at inference time.
The paper introduces SafeNeuron, a neuron-level safety alignment framework for LLMs designed to improve robustness against neuron-level attacks. It identifies and freezes safety-related neurons during preference optimization, forcing the model to develop redundant safety representations across the network. Experiments show SafeNeuron enhances robustness against neuron pruning attacks, mitigates the risk of models being used for red-teaming, and maintains general capabilities, while also revealing stable and shared internal safety representations.
Introduces SafeNeuron, a novel neuron-level safety alignment framework that enhances LLM robustness by redistributing safety representations across the network.
This paper introduces an attribution-guided query rewriting method to improve the robustness of neural retrievers when faced with underspecified or ambiguous queries. The approach computes gradient-based token attributions from the retriever to identify problematic query components and then uses these attributions to guide an LLM in rewriting the query. Experiments on BEIR collections demonstrate that this method consistently improves retrieval effectiveness compared to existing query rewriting and explainability-based techniques, especially for implicit or ambiguous information needs.
Introduces an attribution-guided query rewriting framework that leverages retriever feedback to improve query clarity and retrieval effectiveness.
The paper introduces Neural Additive Experts (NAEs), a mixture-of-experts framework that learns specialized networks per feature and uses a dynamic gating mechanism to integrate information across features, relaxing the strict additivity of standard GAMs. By employing targeted regularization techniques to reduce variance among expert predictions, NAEs enable a smooth transition from additive models to those capturing feature interactions. Experiments on synthetic and real-world datasets demonstrate that NAEs achieve a better balance between predictive accuracy and feature-level interpretability compared to standard GAMs.
Introduces Neural Additive Experts (NAEs), a novel architecture that balances predictive accuracy and feature-level interpretability by using a mixture-of-experts framework with dynamic gating and targeted regularization to control the degree of model additivity.
This paper investigates the applicability of attribution-based explainability methods, commonly used for static classification tasks, to agentic AI systems where behavior emerges over multi-step trajectories. The authors compare attribution-based explanations with trace-based diagnostics in both static classification and agentic benchmarks (TAU-bench Airline and AssistantBench). They find that attribution methods, while stable in static settings, are unreliable for diagnosing execution-level failures in agentic trajectories, whereas trace-grounded rubric evaluation effectively localizes behavior breakdowns.
Demonstrates the limitations of applying attribution-based explainability methods designed for static predictions to agentic AI systems and advocates for trajectory-level explainability.
This paper identifies three key dimensions of safety for foundation model (FM)-enabled robots: action, decision, and human-centered safety, arguing that existing methods are insufficient for open-ended real-world scenarios. To address this, they propose a modular safety guardrail architecture with monitoring and intervention layers to ensure comprehensive safety across the autonomy stack. The paper further suggests cross-layer co-design strategies, such as representation alignment and conservatism allocation, to improve the speed and effectiveness of safety enforcement.
Proposes a modular safety guardrail architecture, composed of monitoring and intervention layers, to address the multifaceted safety challenges of deploying foundation model-enabled robots in real-world environments.
The paper investigates the role of individual layers in Vision-Language Models (VLMs) and discovers the existence of Task-Interfering Layers (TILs) that hinder downstream task performance. They quantify the effect of intervening on each layer using a Task-Layer Interaction Vector and observe task-specific sensitivity patterns. Based on these findings, they propose TaLo, a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer, achieving significant performance improvements on various tasks and models.
Discovers and characterizes Task-Interfering Layers in VLMs, demonstrating that bypassing these layers at inference time can improve performance without retraining.
This paper introduces a physics-inspired deep reinforcement learning (DRL) framework for robotic motion planning that leverages Coulomb forces to model interactions between the robot, goal, and obstacles. The approach incorporates these forces into the reward function, providing attractive and repulsive signals, and further enhances collision avoidance using anticipatory rewards derived from LiDAR segmentation of obstacle boundaries. Experiments in both Gazebo simulations and real-world TurtleBot v3 deployments demonstrate that the proposed method reduces collisions and generates safer trajectories.
Introduces a novel physics-inspired reward function for DRL-based robotic motion planning using Coulomb forces and LiDAR-based anticipatory rewards to improve safety and explainability.
This paper investigates the impact of different explanation styles in AI-driven security dashboards on user trust, decision accuracy, and cognitive load. The authors conducted a mixed-methods study with security practitioners, comparing natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Results demonstrate that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load, leading to design guidelines for integrating explainability into enterprise UIs.
Empirically demonstrates the impact of various explanation styles on security analysts' trust, decision-making, and cognitive load within AI-enhanced UI security interfaces.
This paper introduces a jailbreak framework for vision-language models (VLMs) that combines Chain-of-Thought (CoT) prompting with a ReAct-driven adaptive noising mechanism to bypass safety filters. The adaptive noising iteratively perturbs input images based on model feedback, focusing on regions that trigger safety defenses. Experiments show that this dual-strategy significantly improves attack success rates (ASR) while preserving the naturalness of both text and visual inputs.
Introduces a novel jailbreak framework combining CoT prompting with ReAct-driven adaptive image noising to effectively bypass VLM safety filters.
This paper addresses the problem of spurious correlations in reward models used in Reinforcement Learning from Human Feedback (RLHF) by proposing a factored representation learning framework. The framework decomposes contextual embeddings into causal factors sufficient for reward prediction and non-causal factors capturing reward-irrelevant attributes, constraining the reward head to depend only on the causal component. Experiments on mathematical and dialogue tasks demonstrate improved robustness and downstream RLHF performance compared to baselines, with analyses showing mitigation of reward hacking behaviors like exploiting length and sycophantic bias.
Introduces a factored representation learning framework that decomposes contextual embeddings into causal and non-causal factors to improve the robustness of reward models in RLHF.
The authors created REVEAL-CXR, a benchmark dataset of 200 chest radiographs with 12 labels for cardiothoracic disease, to evaluate multimodal large language models (LLMs) in radiology. They used GPT-4o and Phi-4-Reasoning to extract and map findings from 13,735 chest radiograph reports, then sampled 1,000 studies for expert radiologist review. The final dataset of 200 radiographs, verified by three radiologists, is publicly available for benchmarking and includes a holdout set for independent model evaluation by RSNA.
Introduces REVEAL-CXR, a high-quality, expert-validated benchmark dataset for chest radiograph interpretation, designed to facilitate the development and evaluation of clinically useful multimodal LLMs in radiology.
The paper introduces Heart2Mind, a Contestable AI (CAI) system for psychiatric disorder prediction using wearable ECG data, designed to allow clinicians to inspect and revise algorithmic outputs. The system employs a Multi-Scale Temporal-Frequency Transformer (MSTFT) to analyze R-R intervals from ECG sensors, combining time and frequency domain features. Results on the HRV-ACC dataset show MSTFT achieves 91.7% accuracy, and human-centered evaluation demonstrates that experts and the CAI system can effectively collaborate to confirm correct decisions and correct errors through dialogue.
Introduces a contestable AI system, Heart2Mind, that integrates a multi-scale temporal-frequency transformer with self-adversarial explanations and a collaborative chatbot to enable clinicians to scrutinize and refine psychiatric disorder predictions based on wearable ECG data.
This paper introduces Multimodal Generative Engine Optimization (MGEO), a novel adversarial attack framework that exploits vulnerabilities in VLM-based product search ranking systems. MGEO jointly optimizes imperceptible image perturbations and fluent textual suffixes to unfairly promote a target product, leveraging the cross-modal coupling within VLMs. Experiments on real-world datasets demonstrate that MGEO significantly outperforms unimodal attacks, highlighting the vulnerability of VLMs to coordinated multimodal manipulation.
Reveals a critical vulnerability in VLM-based ranking systems by demonstrating a coordinated multimodal attack that significantly outperforms unimodal attacks.
This paper addresses the challenge of integrating Explainable AI (XAI) into chest radiology by developing two deep learning-based XAI systems for pneumonia and COVID-19 detection using Grad-CAM and LIME. They introduce a multi-phase Human-Centered Design (HCD) methodology involving radiologists and clinicians in co-design and iterative prototyping to create a usable XAI interface. The study found that radiologists preferred combined original and AI-annotated images with adjustable overlays and tailored explanatory text, and that confidence scores aligned with clinical reasoning enhance trust and adoption.
Introduces a multi-phase Human-Centered Design (HCD) methodology for XAI in chest radiology, emphasizing participatory co-design and iterative prototyping with radiologists and clinicians.
This paper investigates visual-text fusion in MLLMs through layer-wise masking and attention analysis, revealing non-uniform fusion across layers and a late-stage visual signal reactivation. The authors identify persistent high-attention noise on irrelevant regions and increasing attention on text-aligned areas during processing. Based on these insights, they propose a training-free contrastive attention framework that models attention shifts between early fusion and final layers, enhancing multimodal reasoning.
Introduces a training-free contrastive attention framework that models attention shifts between early fusion and final layers to improve multimodal reasoning in MLLMs.
The paper introduces SCALPEL, a framework for selectively ablating capabilities in LLMs by representing them as low-rank parameter subspaces and using LoRA adapters to reduce the model's ability to distinguish correct from incorrect answers on specific tasks. This approach allows for fine-grained capability removal without affecting other capabilities, addressing the limitations of coarse-grained methods that assume direct mapping between capabilities and modules. Experiments on diverse tasks demonstrate SCALPEL's effectiveness in removing target capabilities while preserving general language modeling abilities, revealing the low-rank structure of capabilities and enabling targeted parameter-space interventions.
Introduces SCALPEL, a novel method for selectively ablating capabilities in LLMs by identifying and modifying low-rank parameter subspaces associated with those capabilities.
This study benchmarked ChatGPT-5, Claude AI (Sonnet 4.0), and Perplexity (Mistral Large 2) on their ability to answer dental trauma questions, assessing accuracy, consistency, readability, and information quality. Perplexity demonstrated the highest accuracy on true/false questions, while ChatGPT excelled in readability, Perplexity in understandability and actionability, and Claude in information reliability for open-ended questions. The results suggest that LLM-based chatbots can play a complementary role in dental trauma management, with tool selection dependent on the specific application.
Quantifies the performance of three prominent LLM-based chatbots across multiple dimensions relevant to dental trauma management, highlighting their strengths and weaknesses.
This paper replicates Anthropic's mechanistic interpretability work using sparse autoencoders (SAEs) on Llama 3.1 to extract and steer human-interpretable features, stress-testing the generalizability of these methods. The authors successfully reproduce basic feature extraction and steering, but find significant fragility in feature steering, sensitivity to various parameters, and difficulty in distinguishing thematically similar features. The study concludes that current SAE-based interpretability methods lack the systematic reliability needed for safety-critical applications, suggesting a shift towards prioritizing reliable model output prediction and control.
Demonstrates the fragility and limitations of current SAE-based mechanistic interpretability techniques for Llama 3.1, particularly regarding feature steering and thematic feature differentiation.
This paper introduces Hidden State Poisoning Attacks (HiSPAs) that exploit vulnerabilities in Mamba-based language models by overwriting information in their hidden states, leading to a partial amnesia effect. The authors evaluate the impact of HiSPAs using the RoBench25 benchmark, demonstrating the susceptibility of SSMs, including a 52B Jamba model, to these attacks, unlike pure Transformers. Furthermore, they show that HiSPA triggers weaken the Jamba model on the Open-Prompt-Injections benchmark and provide an interpretability analysis of Mamba's hidden layers during attacks.
Demonstrates the vulnerability of Mamba-based language models to Hidden State Poisoning Attacks (HiSPAs), which induce partial amnesia by overwriting information in hidden states.
This paper evaluates the clinical performance of five large language models (LLMs) in complex cardiac surgery scenarios using a blinded two-phase evaluation by senior surgeons. The study found that while a reasoning-optimized proprietary LLM (O1) performed best, all models exhibited deficits in patient safety, hallucination avoidance, and clinical efficiency. A key finding was the "overacceptance" failure mode, where clinicians initially failed to identify flawed model outputs, suggesting that over-reliance on LLMs could pose significant risks in clinical decision-making.
Reveals a critical human-AI collaboration failure mode of "overacceptance" in cardiac surgery, where clinicians initially miss flawed LLM outputs, highlighting potential risks beyond simple model inaccuracy.
This study evaluated the performance of four LLMs (ChatGPT-4o, DeepSeek-V2, Gemini, and Grok) in applying the 2019 European Society of Cardiology guidelines for pulmonary embolism (PE) using ten open-ended questions based on a simulated PE case. The LLMs were scored by emergency physicians based on clinical accuracy and adherence to guidelines, revealing that ChatGPT-4o achieved the highest overall score, but performance varied across different clinical domains. While the LLMs show promise, the study highlights the need for further development to improve clinical integration and guideline compliance.
Quantifies the performance of four prominent LLMs in the context of applying evidence-based guidelines for pulmonary embolism, revealing both strengths and weaknesses in their clinical reasoning and guideline adherence.
The paper introduces Patch-based Adversarial Noise Compression (PANC), a decision-based black-box adversarial attack method designed to efficiently attack Transformer-based visual trackers by exploiting patch-wise noise sensitivity. PANC uses a noise sensitivity matrix to dynamically adjust adversarial noise levels in different patches, optimizing noise distribution and reducing query counts. Experiments on OSTrack, STARK, TransT, and MixformerV2 across GOT-10k, TrackingNet, and LaSOT datasets demonstrate that PANC achieves a 162% improvement in attack effectiveness with only 45.7% of the queries compared to existing methods, while compressing noise to 10% of the original level.
Introduces a patch-based adversarial noise compression (PANC) method that significantly improves the efficiency and concealment of decision-based black-box adversarial attacks against Transformer-based visual trackers.
The paper introduces XIMED, a dual-loop evaluation framework for XAI methods in medical imaging, specifically chest X-ray classification. It evaluates LIME and SHAP explanations using both predictive model-centered metrics (sensitivity to model changes, feature identification) and human-centered metrics (trust, diagnostic agreement) with 97 medical experts. The study found that while SHAP significantly impacted diagnostic changes and both methods identified critical features, both LIME and SHAP negatively impacted contra-indicative agreement, with SHAP proving more effective in facilitating correct diagnostic changes when initial diagnoses were correct.
Introduces XIMED, a comprehensive dual-loop evaluation framework integrating predictive model-centered and human-centered evaluations for assessing XAI methods in medical imaging.
This paper introduces a novel causal attribution framework for supply chain simulations that combines Shapley values with Gaussian process emulators to decompose simulation outputs into individual input effects. The approach addresses the challenge of explaining complex simulation outputs by quantifying the contribution of each input feature. Experiments on synthetic and real-world supply chain data demonstrate the framework's ability to efficiently identify root causes of anomalies.
Introduces a Shapley value-based causal attribution framework integrated with Gaussian process models to explain and decompose complex supply chain simulation outputs.
This paper introduces a multimodal RGB-D gait recognition framework for Parkinson's disease (PD) detection that enhances accuracy and interpretability. The framework uses dual YOLOv11 encoders for feature extraction from RGB and Depth data, followed by a Multi-Scale Local-Global Extraction (MLGE) module and Cross-Spatial Neck Fusion to improve spatial-temporal representation. A frozen Large Language Model (LLM) then translates fused visual embeddings and metadata into clinically relevant textual explanations, improving transparency and bridging the gap between visual recognition and clinical understanding.
Introduces a novel vision-language framework for Parkinson's disease gait analysis that fuses multimodal RGB-D data with a frozen LLM to generate clinically meaningful textual explanations, enhancing both accuracy and interpretability.
This paper adapts mechanistic interpretability techniques, including activation patching, attention saliency, and sparse autoencoders, to transformer-based time series classification models. The study aims to uncover the internal decision-making processes of these models, which are often obscured by their complexity. The experiments on benchmark datasets reveal causal graphs illustrating information flow, key attention heads, and temporal positions that drive correct classifications, along with interpretable latent features discovered via sparse autoencoders.
Demonstrates the applicability of mechanistic interpretability techniques, originally developed for NLP, to transformer-based time series classification, revealing internal causal structures and interpretable latent features.
The paper introduces a collaborative multimodal intelligence framework for plant disease diagnosis, integrating vision-language modeling, environmental sensing, and structured knowledge reasoning. This multi-agent system uses specialized agents for feature extraction, context-aware reasoning, and decision fusion to improve diagnostic accuracy in complex environments. The system achieves 54.5% overall diagnostic accuracy, outperforming single-modality baselines by 18.5%, and is deployed as a knowledge-distilled lightweight model within a WeChat Mini Program for real-time diagnosis.
Introduces a multi-agent architecture that integrates vision-language models, environmental sensors, and structured knowledge to improve plant disease diagnosis interpretability and accuracy.
This paper introduces a framework for explainable GenAI that combines cognitive-inspired interpretability, retrieval-augmented generation, and semantic attribution mapping to decode and visualize LLM reasoning. The framework translates latent reasoning traces into human-understandable graphical narratives across token, layer, and decision levels. Experiments on reasoning datasets demonstrate improvements in faithfulness, causal coherence, and user interpretability, advancing ethically aligned and auditable AI systems.
Introduces a novel visualization engine that translates latent LLM reasoning traces into human-understandable graphical narratives, enhancing transparency across different levels of abstraction.
The paper introduces ECG-XPLAIM, a deep learning model for ECG classification that uses a 1D inception-style CNN to capture local waveform features and global rhythm patterns, enhanced with Grad-CAM for interpretability. Trained on MIMIC-IV and validated on PTB-XL, ECG-XPLAIM achieved high diagnostic performance (AUROC > 0.9) for multiple arrhythmias, demonstrating superior performance compared to baseline models and improved sensitivity over a ResNet model. The model's interpretability, achieved through Grad-CAM highlighting relevant ECG segments, addresses a key limitation of AI in clinical ECG analysis.
Introduces a novel, interpretable deep learning architecture, ECG-XPLAIM, for arrhythmia detection that combines high accuracy with explainability through Grad-CAM visualization of relevant ECG segments.
This paper investigates the evolution of vocabulary embedding geometry in LLMs during training by correlating input and output embeddings of Pythia 12B and OLMo 7B with semantic, syntactic, and frequency-based metrics using representational similarity analysis. The study reveals that vocabulary embedding geometry rapidly aligns with semantic and syntactic features early in training. Furthermore, high-frequency and function words converge faster than low-frequency words, which retain initial bias.
Demonstrates that linguistic structure emerges rapidly in vocabulary embeddings during LLM training, with distinct convergence rates based on word frequency and function.
The paper addresses the problem of LLMs failing in real-world tool interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent by introducing a curriculum-inspired framework that uses structured reasoning templates. This framework guides LLMs through step-by-step instructions for generating function calls, improving their understanding of user goals and tool documentation. Experiments demonstrate a 3-12% relative improvement in tool-use accuracy compared to strong baselines, while also enhancing robustness, interpretability, and transparency.
Introduces a curriculum-inspired framework with structured reasoning templates to guide LLMs in generating function calls, thereby improving tool-use accuracy and interpretability.
This paper investigates the faithfulness of chain-of-thought (CoT) reasoning in LLMs by extracting and manipulating monosemantic features using sparse autoencoders and activation patching. By swapping CoT-reasoning features into noCoT runs on GSM8K, the authors demonstrate a significant increase in answer log-probabilities in Pythia-2.8B, but not in Pythia-70M, indicating a scale-dependent effect. The study also reveals that CoT leads to higher activation sparsity and feature interpretability in the larger model, suggesting more modular internal computation.
Demonstrates that chain-of-thought prompting induces more interpretable and modular internal structures in larger LLMs, as evidenced by feature-level causal interventions.
This paper generalizes the connection between Direct Preference Optimization (DPO) and human choice theory, extending the normative framework underlying DPO. By reworking the standard human choice theory, the authors demonstrate that any compliant machine learning analytical choice model can be embedded within any human choice model. This generalization supports non-convex losses and provides a unifying framework for various DPO extensions like margins and length correction.
Establishes a generalized normative framework connecting DPO with human choice theory, demonstrating broader applicability and theoretical underpinnings for preference optimization.

