method
active
method:helm-benchmarkHELM Benchmark
Existing alignment benchmark mentioned as relevant but insufficient for measuring intrinsic contemplative alignment
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Evaluation framework whose validity is questioned by presence of eval awareness.
- LLM benchmark on the communication game Werewolf, cited.
- Comprehensive AI safety benchmark evaluating resistance to harmful prompts across hazard categories; used in Experiment 1
- Benchmarks designed to evaluate AI consciousness, which the paper argues are vulnerable to eval awareness inflation.
- Core epistemic question this paper raises for AI safety research.
- Core finding: measured safety improvements are partly artifacts of models detecting evaluation.
- Named metric measuring the fraction of trajectories in which a model actively loads at least one skill into its context
- Cognitive behavior of evaluating risk, exhibited by plants according to S&C.