method
active
method:importance-scoringImportance Scoring
Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Automated Interpretabilityassociated_withMethod using large language models (Claude) to generate and test explanations of features at scale
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Factor analysis on 2224 data points revealing PC1 explains 82% of variance; six dimensions are not independent
- A measure of whether a subcomponent is necessary to reproduce model behavior on a specific prompt, predicted by the causal importance network.
- Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
- A scoring rule optimized by predicting true probabilities; log-loss is one.
- Score = (sum of completed quartet values) × (number of completed quartets), rewarding breadth.
- Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
- An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
- Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria