method
active
method:elo-scoreElo score
A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- A set of evaluation criteria for AI assistants.
Methods (1)
method
- Crowdworker model comparison testsimplementsProcedure where crowdworkers compare responses from two models and indicate preference, used to compute Elo scores.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Pairwise comparison results converted to Elo ratings for Alexander mirror aesthetic rankings
- Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
- Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
- Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
- Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
- Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
- Aristotelian idea that everything has a purpose; inspires the focus on purpose in design.
- Research organization focused on AI welfare; employing several authors.