claim

active

claim:sae-features-tend-to-shatter-manifolds-into-many-small-and-apparently-unrelated-pieces-obscuring-the-overarching-semantic-structure

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.

Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.

Source paper

extracted_from

The World Inside Neural Networks

(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3

Neighborhood — ranked by edge-count

Communities (2)

community

Manifold-aware concept steering in neural representations
members_of
Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
SAE Feature Geometry in Biomedical Signals
members_of
Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.

Concepts (4)

concept

manifold
cites
A smooth, potentially curved surface in activation space along which activations vary according to a coherent semantic dimension.
SAE features
cites
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
semantic structure
cites
The meaningful organization of concepts in a model's representation space, claimed to be better captured by manifolds than by SAEs.
shattering
cites
The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.

Vectors (1)

vector

Interpretability as Microscope for Consciousness
addresses_vector

Methods (1)

method

Sparse Autoencoders (SAE)
contradicts
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Source docs (1)

source_doc

2026-05-14_phil-trans-A-goodfire-aboutblank-impact.md
extracted_from

Claims (1)

claim

Manifold-level descriptions recover overarching semantic structure that SAE features miss.
extends
Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.claim0.848
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
Our SAEs' features are more interpretable than neurons.claim0.819
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.806
Claim that feature grounding enables interpretability metrics.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.805
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.800
Extension of mechanistic interpretability findings to the metacognitive domain
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.793
A promising property for interpretability analysis off-distribution.
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.790
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Larger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.claim0.784
Scaling SAE size increases granularity and discovers new features.