method
active
method:importance-scoring

Importance Scoring

Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • Method using large language models (Claude) to generate and test explanations of features at scale

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Factor analysis on 2224 data points revealing PC1 explains 82% of variance; six dimensions are not independent
  • Causal importanceconcept0.771
    A measure of whether a subcomponent is necessary to reproduce model behavior on a specific prompt, predicted by the causal importance network.
  • Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
  • A scoring rule optimized by predicting true probabilities; log-loss is one.
  • Score = (sum of completed quartet values) × (number of completed quartets), rewarding breadth.
  • safety scoresconcept0.728
    Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
  • An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
  • Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria