Sparse Autoencoder Features

Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features

Neighborhood — ranked by edge-count

concept

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoderframework0.919
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Sparse autoencoders produce interpretable features for large models.claim0.859
Central claim of the paper: the method scales to state-of-the-art transformers.
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.856
Empirical principle discovered during autoencoder training; led to using 8 billion training points
TopK Sparse Autoencodersframework0.852
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Sparse Autoencoders (SAE)method0.847
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Sparse Autoencoder for Dictionary Learningframework0.843
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Sparse Autoencoder-based Framework for Steering Semantic Featuresframework0.837
The main framework proposed for retrieving and steering high-order semantic features in LLMs via sparse autoencoders.
Autoencoderconcept0.832
Neural network architecture that learns compressed representations; SOHMs are functionally equivalent.