concept
active
concept:sparse-interpretability-in-neural-networksSparse interpretability in neural networks
VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Neural Network Interpretabilityrelated_toThe field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.774Core methodology paper for SAE-based interpretable feature extraction
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?question0.765Framing question for the paper's research program.
- Mechanism by which superposition works: small neural networks exploit sparsity to approximately simulate much larger sparse networks
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Cited as enabling precise behavioral control through SAE features, extending the same methodological line
- Coding scheme where qualities are represented by few neurons with continuous similarity relations.