concept

active

concept:sparse-autoencoders-find-highly-interpretable-features-in-language-models-cunningham-et-al-2023

Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)

Core methodology paper for SAE-based interpretable feature extraction

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Concepts (1)

concept

Sparse Autoencoder Features
related_to
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders produce interpretable features for large models.claim0.932
Central claim of the paper: the method scales to state-of-the-art transformers.
We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.871
Forward-looking prediction about scalability of the method to larger models
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.870
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.869
Critique of activation-based interpretability methods.
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.852
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.845
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse Autoencoderframework0.831
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.claim0.831
Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.