Feature Interpretability Rubric

14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

interpretabilityconcept0.798
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Circuit Interpretabilityconcept0.760
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
Automated Interpretabilityframework0.758
Method using large language models (Claude) to generate and test explanations of features at scale
Interpretability features converge across different model architectures, revealing structural similarities.claim0.745
Interpretability as Natural Scienceframework0.744
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Median feature interval scored 12/14 on interpretability rubric vs median neuron score of 0finding0.740
Human analysis showing features are substantially more interpretable than neurons
Neural Network Interpretabilityconcept0.739
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Interpretability-driven steeringconcept0.735
General approach of using interpretability feedback to steer model generation.