method
active
method:dictionary-learning-for-neural-network-interpretabilityDictionary Learning for Neural Network Interpretability
Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Internal structure of AI systems that CIMC proposes to analyze interpretively to evaluate consciousness hypotheses
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
- VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
- General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations