claim
active
claim:larger-saes-contain-features-for-concepts-not-captured-in-smaller-saes-indicating-improved-coverageLarger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.
Scaling SAE size increases granularity and discovers new features.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Claim that feature grounding enables interpretability metrics.
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Clamping feature activations causally alters model behavior in interpretable ways.
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- A promising property for interpretability analysis off-distribution.