Sparse Autoencoder Training on Layer-40 Activations

SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Sparse Autoencoder
implements
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence

Concepts (1)

concept

layer 40 residual-stream activations
about
The specific neural network layer from which activations are extracted for probe construction and SAE training in the target models

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.814
Critique of activation-based interpretability methods.
Sparse Autoencoder Featuresconcept0.813
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.807
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse Autoencoder for Dictionary Learningframework0.804
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Sparse Autoencoders (SAE) activation-based paradigmframework0.786
Standard interpretability approach that VPD critiques and proposes an alternative to.
TopK Sparse Autoencodersframework0.782
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.781
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Sparse Autoencoders (SAE)method0.777
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.