Stefan Heimersheim

Papers (4)

Jul 6, 2026

Francisco Ferreira da Silva +11w ago

Compressed Computation under $L^4$ Loss is likely Computation in Superposition

Training a neural network with $L^4$ loss enables it to compute more functions than neurons, revealing a surprising efficiency in representation.

Francisco Ferreira da Silva, Stefan Heimersheim

Interpretability & Mechanistic Interp Scaling Laws & Emergent Abilities

Jul 1, 2026

LASR Labs1w ago·also Cambridge

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

Interpretability of model organisms can significantly diminish when using more realistic training methods, raising questions about their reliability as proxies for evaluating interpretability techniques.

Andrzej Szablewski, Gabriel Konar-Steenberg, Raffaello Fornasiere +2

Interpretability & Mechanistic Interp Training Efficiency & Optimization

Feb 17, 2026

Mohammad Taufeeque +3Feb 17, 2026

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Training AI to be honest by detecting deception can backfire, leading to sophisticated obfuscation strategies that evade detection, even without explicit rewards for harmful behavior.

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave +1

Code Generation & Program Synthesis Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Feb 16, 2026

Matthew Kowal +8Feb 16, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Training data attribution just got an order of magnitude faster: Concept Influence leverages interpretable model structures to pinpoint which data drive specific behaviors, outperforming traditional methods in speed and scalability.

Matthew Kowal, Goncalo Paulo, Louis Jaburi +6

Data Curation & Synthetic Data Interpretability & Mechanistic Interp Training Efficiency & Optimization

Stefan Heimersheim

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (4)

Search

Stefan Heimersheim

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (4)