concept
active
concept:sae-featuresSAE features
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
Neighborhood — ranked by edge-count
Claims (2)
claim
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
- Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
- Interpretable features extracted by sparse autoencoders used as steering targets in this study
- The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
- Out-of-distribution generalization of SAE features.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Claim that feature grounding enables interpretability metrics.