Elo score

A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.

Neighborhood — ranked by edge-count

framework

method

Crowdworker model comparison tests
implements
Procedure where crowdworkers compare responses from two models and indicate preference, used to compute Elo scores.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Elo Rating Conversionmethod0.839
Pairwise comparison results converted to Elo ratings for Alexander mirror aesthetic rankings
Probe scoreconcept0.733
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Liar Scoreconcept0.730
Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
Pass Rate Scoringmethod0.719
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
safety scoresconcept0.706
Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
Importance Scoringmethod0.699
Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
Telosconcept0.699
Aristotelian idea that everything has a purpose; inspires the focus on purpose in design.
Eleos AIinstitute0.698
Research organization focused on AI welfare; employing several authors.