Importance Scoring

Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation

Neighborhood — ranked by edge-count

framework

Automated Interpretability
associated_with
Method using large language models (Claude) to generate and test explanations of features at scale

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Factor Analysis on Scoring Dimensionsmethod0.782
Factor analysis on 2224 data points revealing PC1 explains 82% of variance; six dimensions are not independent
Causal importanceconcept0.771
A measure of whether a subcomponent is necessary to reproduce model behavior on a specific prompt, predicted by the causal importance network.
multiplicative scoringconcept0.762
Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
Proper scoring ruleconcept0.736
A scoring rule optimized by predicting true probabilities; log-loss is one.
multiplicative scoring ruleconcept0.735
Score = (sum of completed quartet values) × (number of completed quartets), rewarding breadth.
safety scoresconcept0.728
Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
Term Importance Analysis via Ablationmethod0.724
An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
Pass Rate Scoringmethod0.711
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria