Feature splitting

Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.

Neighborhood — ranked by edge-count

Papers (1)

paper

Interpreting Language Model Parameters
about

Claims (1)

claim

VPD subcomponents are sparse, interpretable, and avoid feature splitting.
cites
Assertion about the qualitative advantages of VPD's rank-one decomposition.

Methods (1)

method

UMAP Embedding of Features
supports
2D embedding of feature direction vectors used to visualize feature clusters and splitting geometry

Concepts (2)

concept

monosemanticity
associated_with
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
Single-Token Features
associated_with
Features that fire on every instance of a single token; appear in small dictionaries as collapsed versions of many token-in-context features

Findings (1)

finding

Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specific
supports
Concrete example of feature splitting revealing unexpected model structure

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature splitting means dictionaries with fewer features provide coarser summaries of model features while larger dictionaries reveal finer-grained distinctions, with no uniquely 'correct' number of featuresclaim0.818
Authors argue the absence of a fixed feature count is a property of the superposition geometry, not a failure of the method
Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.claim0.817
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
Feature Sparsityconcept0.782
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
Lot splittingconcept0.777
The subdivision of properties to create smaller, individually owned lots that support unique buildings and increased density.
Feature Universalityconcept0.762
Property of features that form consistently across different models trained on the same or similar data, suggesting features are real representational units
Feature engineeringconcept0.756
Domain of techniques for constructing informative features from raw data; covariance pooling is a feature engineering method for token sequences.
feature as applicationconcept0.755
Metaphor treating each system feature or function as a separate application that can be independently loaded and managed.
Feature Visualizationmethod0.753
Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link