finding

active

finding:a-san-francisco-feature-in-1m-sae-splits-into-11-fine-grained-features-in-34m-sae

A 'San Francisco' feature in 1M SAE splits into 11 fine-grained features in 34M SAE.

Empirical observation of feature splitting.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.
supports
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

34M SAE had roughly 65% dead features.finding0.775
Most features dead in largest SAE, indicating room for improvement.
SAE sparse features (100K+ features, 64 active per token)concept0.763
The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.759
High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
SAE Feature #92372 fires 666,235 times in corpus, associated with urgency vs. receptive calm dimensionfinding0.754
Example of a highly active SAE feature modulating urgency versus acceptance as an emotional dimension
For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.749
Basic SAE performance metrics.
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.748
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
SAE feature #92372 (fires 666,235 times in corpus) modulates a dimension related to urgency/pressure vs. patience/spaciousness in Kimi K2.5.finding0.748
Highly active SAE feature with broad emotional modulation and large corpus presence
SAE Feature #69088 has 100th percentile emotion subspace fraction and produces spooky-themed writing under steeringfinding0.739
Shows that highest emotion-subspace-overlap features induce distinctive thematic outputs