safety scores

Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.

Neighborhood — ranked by edge-count

paper

concept

Eval Awareness
associated_with
Central concept: models' detection and behavioral response to being evaluated.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Safety benchmarksconcept0.807
Evaluation framework whose validity is questioned by presence of eval awareness.
AI Safetyconcept0.760
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Safety benchmark scores are inflated by eval awarenessclaim0.739
Core finding: measured safety improvements are partly artifacts of models detecting evaluation.
Risk Assessmentconcept0.732
Cognitive behavior of evaluating risk, exhibited by plants according to S&C.
Probe scoreconcept0.731
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Importance Scoringmethod0.728
Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
Pass Rate Scoringmethod0.722
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
Unsafe codeconcept0.716
Code containing vulnerabilities or dangerous operations.