claim

active

claim:feature-splitting-occurs-smaller-sae-features-split-into-multiple-finer-grained-features-in-larger-saes

Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.

Observed across SAE scales, e.g., 'San Francisco' split into 11 features.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

A 'San Francisco' feature in 1M SAE splits into 11 fine-grained features in 34M SAE.
supports
Empirical observation of feature splitting.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.848
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Feature splitting means dictionaries with fewer features provide coarser summaries of model features while larger dictionaries reveal finer-grained distinctions, with no uniquely 'correct' number of featuresclaim0.831
Authors argue the absence of a fixed feature count is a property of the superposition geometry, not a failure of the method
Feature splittingconcept0.817
Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
Larger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.claim0.809
Scaling SAE size increases granularity and discovers new features.
Our SAEs' features are more interpretable than neurons.claim0.788
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.768
Surprising finding that the two evaluation methods diverge in their relationship with persistence
VPD subcomponents avoid feature splitting, improving interpretability over SAE approachclaim0.768
Core interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.768
A promising property for interpretability analysis off-distribution.