Neural Network Interpretability

The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper

Neighborhood — ranked by edge-count

paper

thinker

Chris Olah
studies
Co-author; provided high-level research guidance, wrote introduction/discussion.

framework

Interpretability as Natural Science
about
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies

concept

Sparse interpretability in neural networks
related_to
VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
Causal abstraction
associated_with
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Knowledge Localization
associated_with
Technique for identifying where specific knowledge is stored in neural network layers via interventions
Zooming In (scientific methodology)
associated_with
The metaphor for a qualitative shift in scientific inquiry to finer-grained detail, analogous to the microscope's role in cellular biology

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Dictionary Learning for Neural Network Interpretabilitymethod0.864
Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
interpretabilityconcept0.818
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Interpretability tools can reveal what 'feeling alive' looks like inside a neural network model.claim0.802
Circuit Interpretabilityconcept0.802
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?question0.797
Framing question for the paper's research program.
Automated Interpretabilityframework0.796
Method using large language models (Claude) to generate and test explanations of features at scale
Neural Networksconcept0.791
neural cognitionconcept0.781
Cognition in nervous systems, used as a modelling target