claim
active
claim:our-saes-features-are-more-interpretable-than-neuronsOur SAEs' features are more interpretable than neurons.
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Quantitative comparison supporting SAE utility.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Claim that feature grounding enables interpretability metrics.
- SAE features are not simply mirroring individual neurons.
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Claim from footnote 3, acknowledging neuron-level interpretability while arguing subcomponents are better.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Surprising finding that the two evaluation methods diverge in their relationship with persistence