claim
active
claim:feature-splitting-means-dictionaries-with-fewer-features-provide-coarser-summaries-of-model-features-while-larger-dictionaries-reveal-finer-grained-distinctions-with-no-uniquely-correct-number-of-featuresFeature splitting means dictionaries with fewer features provide coarser summaries of model features while larger dictionaries reveal finer-grained distinctions, with no uniquely 'correct' number of features
Authors argue the absence of a fixed feature count is a property of the superposition geometry, not a failure of the method
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (1)
finding
- Concrete example of feature splitting revealing unexpected model structure
Questions (1)
question
- what is the 'correct number of features' for dictionary learning, and is this question well-posed?gatesOpen question about whether there is a true discrete feature count or a continuous splitting process
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
- Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
- Motivation for using sparsity-based dictionary learning on language models
- Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.740Foundational SAE mechanistic interpretability paper
- Bigger models are more likely to converge to a shared representation than smaller modelshypothesis0.739Selective pressure toward convergence via model capacity
- Authors take agnostic position on ontological status but universality evidence pushes toward features being real