claim

active

claim:feature-splitting-means-dictionaries-with-fewer-features-provide-coarser-summaries-of-model-features-while-larger-dictionaries-reveal-finer-grained-distinctions-with-no-uniquely-correct-number-of-features

Feature splitting means dictionaries with fewer features provide coarser summaries of model features while larger dictionaries reveal finer-grained distinctions, with no uniquely 'correct' number of features

Authors argue the absence of a fixed feature count is a property of the superposition geometry, not a failure of the method

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specific
supports
Concrete example of feature splitting revealing unexpected model structure

Questions (1)

question

what is the 'correct number of features' for dictionary learning, and is this question well-posed?
gates
Open question about whether there is a true discrete feature count or a continuous splitting process

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.claim0.831
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
Feature splittingconcept0.818
Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.765
Motivation for using sparsity-based dictionary learning on language models
There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.claim0.759
Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.750
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.740
Foundational SAE mechanistic interpretability paper
Bigger models are more likely to converge to a shared representation than smaller modelshypothesis0.739
Selective pressure toward convergence via model capacity
Feature universality across independently trained models suggests features have some existence beyond individual modelsclaim0.737
Authors take agnostic position on ontological status but universality evidence pushes toward features being real