Interpretability as Natural Science

Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies

Neighborhood — ranked by edge-count

paper

concept

Neural Network Interpretability
about
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

interpretabilityconcept0.881
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Automated Interpretabilityframework0.836
Method using large language models (Claude) to generate and test explanations of features at scale
Interpretability Illusionconcept0.806
Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
Interpretability today is a pre-paradigmatic field lacking consensus on objects of study, methods, and evaluative standards.claim0.795
Diagnosis of the state of the interpretability field, drawing on Kuhn's framework
Circuit Interpretabilityconcept0.789
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
"For interpretability, I don't think we even have the right definitions."quote0.785
Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research
interpretative methodmethod0.778
The historical/hermeneutic approach adopted by the paper to analyze cybernetic diagrams in light of Flusser’s philosophy.
Bottom-up interpretabilityconcept0.774
An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.