claim
active
claim:sae-features-tend-to-shatter-manifolds-into-many-small-and-apparently-unrelated-pieces-obscuring-the-overarching-semantic-structureSAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Source paper
extracted_from(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3
Neighborhood — ranked by edge-count
Communities (2)
community
- Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
- Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.
Concepts (4)
concept
- manifoldcitesA smooth, potentially curved surface in activation space along which activations vary according to a coherent semantic dimension.
- SAE featurescitesThe individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- semantic structurecitesThe meaningful organization of concepts in a model's representation space, claimed to be better captured by manifolds than by SAEs.
- shatteringcitesThe phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
Vectors (1)
vector
- Interpretability as Microscope for Consciousnessaddresses_vector
Methods (1)
method
- Sparse Autoencoders (SAE)contradictsInterpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Source docs (1)
source_doc
- 2026-05-14_phil-trans-A-goodfire-aboutblank-impact.mdextracted_from
Claims (1)
claim
- Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Claim that feature grounding enables interpretability metrics.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Extension of mechanistic interpretability findings to the metacognitive domain
- A promising property for interpretability analysis off-distribution.
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Scaling SAE size increases granularity and discovers new features.