concept
active
concept:safety-scores

safety scores

Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Eval Awareness
    associated_with
    Central concept: models' detection and behavioral response to being evaluated.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Safety benchmarksconcept0.807
    Evaluation framework whose validity is questioned by presence of eval awareness.
  • AI Safetyconcept0.760
    The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
  • Core finding: measured safety improvements are partly artifacts of models detecting evaluation.
  • Risk Assessmentconcept0.732
    Cognitive behavior of evaluating risk, exhibited by plants according to S&C.
  • Probe scoreconcept0.731
    Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
  • Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation
  • Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
  • Unsafe codeconcept0.716
    Code containing vulnerabilities or dangerous operations.