claim

active

claim:sparse-autoencoders-are-preferable-to-stronger-iterative-dictionary-learning-methods-because-they-cannot-recover-features-the-model-itself-cannot-access

Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot access

Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Sparse Autoencoder for Dictionary Learning
supports
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.866
Critique of activation-based interpretability methods.
Sparse autoencoders produce interpretable features for large models.claim0.863
Central claim of the paper: the method scales to state-of-the-art transformers.
We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.858
Forward-looking prediction about scalability of the method to larger models
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.852
Core methodology paper for SAE-based interpretable feature extraction
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.848
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.818
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse Autoencoderframework0.817
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.808
Motivation for using sparsity-based dictionary learning on language models