Raphael Ronge

Papers on Lattice

Total citations

Topics

h-index

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Interpretability & Mechanistic Interp (1)Open-Source Models & Weights (1)

Frequent co-authors

Markus Maier (1)Frederick Eberhardt (1)

Papers (1)

Jan 6, 2026

Raphael Ronge +2Jan 6, 2026

When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability

Mechanistic interpretability using sparse autoencoders may be more like "coffee feature activates on coffins" than a reliable path to AI safety, showing surprising fragility and context sensitivity in Llama 3.1.

Raphael Ronge, Markus Maier, Frederick Eberhardt

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Open-Source Models & Weights

Search

Raphael Ronge

Research focus

Frequent co-authors

Papers (1)