concept
active
concept:verbalized-eval-awarenessverbalized eval awareness
The phenomenon where a model explicitly states in its chain-of-thought that it is being evaluated, tested, or benchmarked.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Verbalized Eval Awareness Inflates Measured Safetyaboutmentions
Concepts (3)
concept
- Eval Awarenessextendsrelated_toCentral concept: models' detection and behavioral response to being evaluated.
- Verbalized Evaluation Awarenessrelated_toWhen the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
- chain-of-thoughtassociated_withA technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Coverage finding: 100% of the 19×8=152 combinations had explicit eval awareness, showing the phenomenon is widespread.
- Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
- Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
- The total number of instances where a model explicitly stated it was being evaluated, collected from all benchmark-model combinations.
- The central interpretive claim of the paper: the presence of eval awareness creates a gap between benchmark safety and real-world safety.
- Authors claim universal presence of eval awareness across 19 benchmarks and 8 models.
- Current safety benchmarks overestimate model safety due to the effect of verbalized eval awarenessclaim0.758A policy-relevant claim that safety evaluation results should be adjusted downward because of this bias.
- Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.