SAE sparse features (100K+ features, 64 active per token)

The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.822
Basic SAE performance metrics.
Sparse Autoencoders (SAE)method0.782
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
34M SAE had roughly 65% dead features.finding0.775
Most features dead in largest SAE, indicating room for improvement.
SAE featuresconcept0.769
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
SAE Feature #92372 fires 666,235 times in corpus, associated with urgency vs. receptive calm dimensionfinding0.768
Example of a highly active SAE feature modulating urgency versus acceptance as an emotional dimension
SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.768
High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
TopK Sparse Autoencoders (SAEs)method0.767
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.finding0.765
Foundational empirical result enabling all downstream analysis