Sparse Autoencoder

Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence

Neighborhood — ranked by edge-count

method

Sparse Autoencoder Training on Layer-40 Activations
implements
SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis

framework

Sparse Autoencoder for Dictionary Learning
related_to
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
TopK Sparse Autoencoders
related_to
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoder Featuresconcept0.919
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Sparse Autoencoders (SAE)method0.880
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Autoencoderconcept0.877
Neural network architecture that learns compressed representations; SOHMs are functionally equivalent.
Deep Autoencoderframework0.845
Sparse autoencoders produce interpretable features for large models.claim0.841
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.839
Critique of activation-based interpretability methods.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.831
Core methodology paper for SAE-based interpretable feature extraction
Sparse Autoencoder-based Framework for Steering Semantic Featuresframework0.830
The main framework proposed for retrieving and steering high-order semantic features in LLMs via sparse autoencoders.