Interpretability Illusion

Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature

Neighborhood — ranked by edge-count

paper

concept

interpretability
related_to
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interpretability as Natural Scienceframework0.806
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Automated Interpretabilityframework0.800
Method using large language models (Claude) to generate and test explanations of features at scale
Circuit Interpretabilityconcept0.790
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
"For interpretability, I don't think we even have the right definitions."quote0.781
Ian Goodfellow quote used to illustrate the pre-paradigmatic state of interpretability research
Bottom-up interpretabilityconcept0.780
An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.
Neural Network Interpretabilityconcept0.774
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Interpretive Validationconcept0.771
CIMC's methodology for evaluating whether a built system is conscious: combining multiple forms of evidence including predicted functional organization and developmental trajectories
Curse of Dimensionality in Interpretabilityconcept0.759
As models grow larger, the latent space volume grows exponentially, making enumeration impossible without decomposition