Sparse Dictionary Learning

General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Sparse Autoencoder for Dictionary Learning
implements
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Feature Dictionaryconcept0.855
The extracted set of sparse interpretable features from model embeddings via SAEs
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.799
Motivation for using sparsity-based dictionary learning on language models
Dictionary Learning for Neural Network Interpretabilitymethod0.769
Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.767
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Sparse Autoencoderframework0.765
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Learningconcept0.755
Inference of parameters encoding contingencies of the world (e.g., likelihood matrix A) at slower timescale than perception.
Sparse and smooth codingconcept0.754
Coding scheme where qualities are represented by few neurons with continuous similarity relations.
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.748
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations