interpretability

The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.

Neighborhood — ranked by edge-count

paper

concept

Interpretability Illusion
related_to
Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated Interpretabilityframework0.886
Method using large language models (Claude) to generate and test explanations of features at scale
Interpretability as Natural Scienceframework0.881
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Circuit Interpretabilityconcept0.845
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
Interpretive Validationconcept0.842
CIMC's methodology for evaluating whether a built system is conscious: combining multiple forms of evidence including predicted functional organization and developmental trajectories
"For interpretability, I don't think we even have the right definitions."quote0.830
Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research
interpretative methodmethod0.827
The historical/hermeneutic approach adopted by the paper to analyze cybernetic diagrams in light of Flusser’s philosophy.
interpretive abstraction (method)method0.819
Programming technique to restructure a fine-grained Linda program for efficiency by replacing live data structures with passive ones and coarser-grain processes.
Neural Network Interpretabilityconcept0.818
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper