claim
active
claim:manifold-level-descriptions-recover-overarching-semantic-structure-that-sae-features-missManifold-level descriptions recover overarching semantic structure that SAE features miss.
Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.
Source paper
extracted_from(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3
Neighborhood — ranked by edge-count
Concepts (3)
concept
- SAE featurescitesThe individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- semantic structurecitesThe meaningful organization of concepts in a model's representation space, claimed to be better captured by manifolds than by SAEs.
- An interpretability approach that describes representations in terms of entire curved manifolds rather than many small features.
Claims (1)
claim
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claim that feature grounding enables interpretability metrics.
- Generalization finding from the full paper extending beyond days-of-week to other structured concepts.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Features may not be strictly one-dimensional objects; higher-dimensional feature manifolds may exist in model representationshypothesis0.749Extension of superposition hypothesis to account for continuous families of features