method
active
method:elo-score

Elo score

A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Methods (1)

method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Pairwise comparison results converted to Elo ratings for Alexander mirror aesthetic rankings
  • Probe scoreconcept0.733
    Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
  • Liar Scoreconcept0.730
    Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
  • Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
  • safety scoresconcept0.706
    Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
  • Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
  • Telosconcept0.699
    Aristotelian idea that everything has a purpose; inspires the focus on purpose in design.
  • Eleos AIinstitute0.698
    Research organization focused on AI welfare; employing several authors.