Search papers, labs, and topics across Lattice.
Understanding the internal mechanisms of neural networks through circuit analysis, feature visualization, and mechanistic interpretability.
#1 of 24
3
Transformers may succeed at time series forecasting without relying on the complex superposition that drives their power in NLP, challenging the assumption that these models are leveraging rich compositional representations.
Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.
Steering neural networks through the intrinsic geometry of their activations unlocks more natural and controllable behaviors than traditional linear interventions.
Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.
Steering LLMs with conceptors—soft projection matrices capturing the full semantic subspace—yields more robust control and enables Boolean logic for composing concepts, moving beyond the limitations of single-vector steering.
Geometric continuity in deep networks isn't just a byproduct of depth, but an actively sculpted property arising from the interplay of residual connections and symmetry-breaking activations.
Forget retraining: this model learns interpretable logical rules from data in a zero-shot manner by encoding literals with domain-agnostic statistical properties.
Feature importance in machine learning models can be surprisingly unreliable: even when models predict accurately, the features they deem important can vary wildly, especially with small datasets.
Token embedding geometry isn't just abstract math—it directly mirrors how language models internally represent and reason about the world, as shown by its alignment with board state and piece importance in chess.
Symmetric spectral analysis of attention is fundamentally blind to information flow direction, but a simple asymmetry coefficient can restore the signal.
LLMs can construct interpretable, multi-layered models of individual student cognition from journal entries, opening new possibilities for personalized education.
Forget opaque transformers: Gyan offers SOTA language modeling with full interpretability, lower compute, and human-like compositional understanding.
Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.
Ditch the black box: This unsupervised semantic projection method rivals supervised models in psychological assessment, offering interpretability and generalizability that supervised methods lack.
Stop reinventing the wheel (or worse, comparing apples to oranges) in XAI evaluation: a standardized "XAI Evaluation Card" could finally bring clarity and rigor to a fragmented field.
Stop squinting at Nsight Compute profiles: KEET uses LLMs to automatically diagnose GPU kernel bottlenecks and suggest optimizations in plain English.
Make your prompts 5x more interpretable without hurting accuracy: IPL combines discrete token selection with continuous optimization, and it's plug-and-play with existing methods.
Activation steering can finally match the nuanced control of prompt engineering: token-specific interventions learned from prompts let you steer LLMs more effectively.
Clinicians trust AI recommendations nearly 3x more when those recommendations are broken down into verifiable facts linked to source guidelines, blowing traditional explainability out of the water.
Forget human-readable models: Agentic-imodels evolves ML models that are optimized for LLM interpretability, boosting agentic data science performance by up to 73%.
Transformers generalize out-of-distribution not by clever interpolation, but by learning a separate, orthogonal representation subspace for unseen tasks.
Releasing differentially private explanations of GNN predictions doesn't hide your graph structure as much as you think: adversaries can reconstruct it with surprising accuracy.
Fixed confidence thresholds are holding back explainable autonomous driving systems, but this new adaptive approach and dataset could unlock better performance and cross-cultural understanding.
Adversarial attacks on speech models leave tell-tale geometric fingerprints in early representation layers that can be detected without transcripts.
Unlocking interpretable clinical forecasting: StructGP recovers causal relationships and patient progression patterns directly from irregular EHR data, outperforming black-box methods in accuracy and uncertainty calibration.
TEA Nets reveal that LLMs express sadness with lower emotional intensity than humans in psychotherapy contexts, highlighting potential limitations in their ability to simulate genuine emotional responses.
Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.
Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.
LLMs can have their personalities surgically altered by tweaking just 0.5% of their neurons, preserving general capabilities while achieving competitive control.
Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.
CNN classifiers don't just select from cleaned features, they actively cancel out shared background information via destructive interference, rewriting our understanding of how these networks actually "see".
TSFMs can achieve competitive forecasting performance in critical infrastructure applications while also providing interpretable explanations that align with established domain knowledge.
Sparse autoencoders, despite their popularity for extracting interpretable features, often fail to capture the underlying manifold structure of concepts, instead fragmenting them across multiple, diluted features.
Pinpointing the root cause of transformer failures just got a whole lot easier: DEFault++ can detect, categorize, and diagnose faults with high accuracy, even down to specific mechanisms.
Uncover hidden drivers of disparity: pinpoint the specific combinations of characteristics that explain outcome gaps between populations.
LLMs betray prompt injection attacks with a tell-tale "restlessness" in their activation trajectories, detectable even when individual turns appear harmless.
Claims of human-like cognition in models like CENTAUR crumble under LAPITHS, a framework that reveals these models' performance can be replicated by systems lacking cognitive plausibility.
LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.
Texture, not color, is the secret sauce behind fashion house identity, revealed by probing a multimodal CNN trained on decades of Vogue runway images.
Uncover the hidden drivers behind your KPIs: a new attribution framework finally explains *why* your metrics move, not just *what* changed.
LLMs aren't just memorizing words; they're organizing them in a feature space that mirrors the nuanced semantic relationships humans perceive.
LLMs' factual recall falters when fine-tuned on new information, and this can be traced to specific latent directions in the residual stream.
Quantum computing can surface critical network attack patterns that classical methods miss, achieving up to 99.6% test precision on unique subgroups.
Quantum annealing offers a surprisingly effective route to interpretable AI, outperforming standard gradient-based methods in disentangling CNN decision boundaries.
GNNs tagging jets at the LHC aren't black boxes: explainability methods reveal they learn physically meaningful features of QCD, with performance varying predictably across energy regimes.
Rule extraction from tree ensembles just got 22x faster, without sacrificing accuracy or interpretability.
Forget what you thought you knew about how models learn: analyzing loss gradients, not just parameter updates, reveals a hidden order of magnitude increase in the coupling between learned features and parameter space.
LLMs process emotions in three distinct phases, but some emotions like Disgust are represented far more weakly and diffusely than others.
Feature decorrelation during training not only sharpens saliency maps, but also *improves* model accuracy, challenging the conventional wisdom that interpretability comes at the cost of performance.
Concept extraction's identifiability problem just got a lot easier, thanks to a new framework that turns guarantee proofs into set intersection problems.