Automated Interpretability

Method using large language models (Claude) to generate and test explanations of features at scale

Neighborhood — ranked by edge-count

thinker

Steven Bills
extends
Developed automated interpretability approach using LLMs to explain neuron activations

method

Importance Scoring
associated_with
Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

interpretabilityconcept0.886
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Interpretability as Natural Scienceframework0.836
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Automated interpretability pipeline using LLMsmethod0.813
Using Claude 3 Opus to generate feature explanations and predict held-out activations.
Circuit Interpretabilityconcept0.809
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
Interpretability Illusionconcept0.800
Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
interpretive abstraction (method)method0.800
Programming technique to restructure a fine-grained Linda program for efficiency by replacing live data structures with passive ones and coarser-grain processes.
Neural Network Interpretabilityconcept0.796
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Bottom-up interpretabilityconcept0.790
An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.