Dictionary Learning for Neural Network Interpretability

Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection

Neighborhood — ranked by edge-count

Concepts (1)

concept

Representational Embedding Spaces
supports
Internal structure of AI systems that CIMC proposes to analyze interpretively to evaluate consciousness hypotheses

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neural Network Interpretabilityconcept0.864
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Sparse interpretability in neural networksconcept0.797
VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
Neural Decoding Programframework0.776
Neural Decodingframework0.772
Sparse Dictionary Learningmethod0.769
General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
Sparse Autoencoder for Dictionary Learningframework0.761
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Neural Networksconcept0.759
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.758
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations