SAE features

The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.

Neighborhood — ranked by edge-count

claim

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
cites
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Manifold-level descriptions recover overarching semantic structure that SAE features miss.
cites
Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE Feature Steeringframework0.855
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Textual SAE feature emotionality evaluationmethod0.781
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
SAE Latentsconcept0.776
Interpretable features extracted by sparse autoencoders used as steering targets in this study
SAE sparse features (100K+ features, 64 active per token)concept0.769
The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.769
Out-of-distribution generalization of SAE features.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.765
Extension of mechanistic interpretability findings to the metacognitive domain
Sparse Autoencoders (SAE)method0.765
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.759
Claim that feature grounding enables interpretability metrics.