claim

active

claim:manifold-level-descriptions-recover-overarching-semantic-structure-that-sae-features-miss

Manifold-level descriptions recover overarching semantic structure that SAE features miss.

Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.

Source paper

extracted_from

The World Inside Neural Networks

(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3

Neighborhood — ranked by edge-count

Concepts (3)

concept

SAE features
cites
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
semantic structure
cites
The meaningful organization of concepts in a model's representation space, claimed to be better captured by manifolds than by SAEs.
manifold-level description
cites
An interpretability approach that describes representations in terms of entire curved manifolds rather than many small features.

Claims (1)

claim

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
extends
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE-based mechanistic interpretability will be superseded by manifold-based analysis for understanding semantic concepts within 24 months.prediction0.806
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.790
Claim that feature grounding enables interpretability metrics.
Analogous alignment between representation manifold and behavior manifold is found across months, letters, ages, and synthetic in-context learning tasks in language models.finding0.780
Generalization finding from the full paper extending beyond days-of-week to other structured concepts.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.778
Extension of mechanistic interpretability findings to the metacognitive domain
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.760
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Our SAEs' features are more interpretable than neurons.claim0.752
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.749
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Features may not be strictly one-dimensional objects; higher-dimensional feature manifolds may exist in model representationshypothesis0.749
Extension of superposition hypothesis to account for continuous families of features