claim

active

claim:larger-saes-contain-features-for-concepts-not-captured-in-smaller-saes-indicating-improved-coverage

Larger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.

Scaling SAE size increases granularity and discovers new features.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.claim0.809
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.797
Extension of mechanistic interpretability findings to the metacognitive domain
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.784
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.781
Claim that feature grounding enables interpretability metrics.
Our SAEs' features are more interpretable than neurons.claim0.775
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Features can be used to steer large models.claim0.767
Clamping feature activations causally alters model behavior in interpretable ways.
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.757
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.757
A promising property for interpretability analysis off-distribution.