Sparse Feature Dictionary

The extracted set of sparse interpretable features from model embeddings via SAEs

Neighborhood — ranked by edge-count

framework

TopK Sparse Autoencoders
introduces
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Dictionary Learningmethod0.855
General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
Feature Sparsityconcept0.814
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
Superposition of Sparse Featuresconcept0.811
Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
Sparse Autoencoder for Dictionary Learningframework0.804
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Sparse Autoencoder Featuresconcept0.799
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Feature Densityconcept0.797
Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
Sparse and smooth codingconcept0.785
Coding scheme where qualities are represented by few neurons with continuous similarity relations.
Sparse Autoencoderframework0.773
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence