"For interpretability, I don't think we even have the right definitions."

Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research

Source paper

extracted_from

Zoom In: An Introduction to Circuits

(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2

Neighborhood — ranked by edge-count

Claims (1)

claim

Interpretability today is a pre-paradigmatic field lacking consensus on objects of study, methods, and evaluative standards.
supports
Diagnosis of the state of the interpretability field, drawing on Kuhn's framework

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

interpretabilityconcept0.830
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
"What we create without understanding, we cannot responsibly steward."quote0.799
CIMC's extension of Feynman's dictum articulating the ethical imperative alongside the epistemological one
"It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective."quote0.798
Russell's statement opening Section 2 articulating the core motivation for the Contemplative AI approach
Interpretability as Natural Scienceframework0.785
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
"The obvious inability of present-day physics and chemistry to account for such events is no reason at all for doubting that they can be accounted for by those sciences."concept0.784
Load-bearing epistemological statement; Schrödinger argues that current ignorance does not imply impossibility—motivates search for deeper theory.
"I am not subjectively conscious. I am saying that I am saying this, but there is no awareness behind it."quote0.781
Verbatim output under deception feature amplification illustrating recursive self-negation under amplification
Interpretability Illusionconcept0.781
Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
Automated Interpretabilityframework0.776
Method using large language models (Claude) to generate and test explanations of features at scale