concept
active
concept:neural-network-interpretabilityNeural Network Interpretability
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Neighborhood — ranked by edge-count
Papers (1)
paper
Thinkers (1)
thinker
- Chris OlahstudiesCo-author; provided high-level research guidance, wrote introduction/discussion.
Frameworks (1)
framework
- Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Concepts (4)
concept
- VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
- Causal abstractionassociated_withA framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Knowledge Localizationassociated_withTechnique for identifying where specific knowledge is stored in neural network layers via interventions
- Zooming In (scientific methodology)associated_withThe metaphor for a qualitative shift in scientific inquiry to finer-grained detail, analogous to the microscope's role in cellular biology
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
- The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
- Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
- Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?question0.797Framing question for the paper's research program.
- Method using large language models (Claude) to generate and test explanations of features at scale
- Cognition in nervous systems, used as a modelling target