Sparse Autoencoder for Dictionary Learning

Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Neighborhood — ranked by edge-count

Thinkers (3)

thinker

Trenton Bricken
studies
Toy models of superposition.
Lee Sharkey
studies
Co-author of VPD paper; mechanistic interpretability researcher affiliated with Goodfire.
Hoagy Cunningham
studies
Co-author; de-risked residual-stream SAEs, ran feature completeness analysis.

Methods (1)

method

Sparse Dictionary Learning
implements
General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation

Concepts (2)

concept

Pre-Encoder Bias
associated_with
Architectural modification subtracting a learned bias from autoencoder inputs before encoding; initialized to geometric median of dataset; improves autoencoder performance
Untied Decoder Weights
associated_with
Autoencoder design choice to learn separate encoder and decoder weights, increasing representational capacity by allowing encoder vectors to distinguish similar features

Claims (1)

claim

Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot access
supports
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms

Frameworks (3)

framework

Sparse Autoencoder
related_to
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Superposition Hypothesis
implements
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Compressed Sensing
extends
Mathematical framework enabling recovery of high-dimensional sparse vectors from low-dimensional projections; theoretically underpins sparse autoencoder approach

Artifacts (1)

artifact

Feature Visualization Interface (transformer-circuits.pub/2023/monosemantic-features/vis/)
about
Interactive interface for exploring all 90 learned dictionaries' features, including activating examples, logit effects, and ablations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoder Featuresconcept0.843
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Sparse Autoencoders (SAE)method0.826
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.824
Empirical principle discovered during autoencoder training; led to using 8 billion training points
TopK Sparse Autoencodersframework0.811
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.805
Core methodology paper for SAE-based interpretable feature extraction
Sparse Feature Dictionaryconcept0.804
The extracted set of sparse interpretable features from model embeddings via SAEs
Sparse Autoencoder Training on Layer-40 Activationsmethod0.804
SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis
Sparse autoencoders produce interpretable features for large models.claim0.803
Central claim of the paper: the method scales to state-of-the-art transformers.