method
active
method:feature-interpretability-rubricFeature Interpretability Rubric
14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
- Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
- Method using large language models (Claude) to generate and test explanations of features at scale
- Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
- Median feature interval scored 12/14 on interpretability rubric vs median neuron score of 0finding0.740Human analysis showing features are substantially more interpretable than neurons
- The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
- General approach of using interpretability feedback to steer model generation.