verbalized eval awareness

The phenomenon where a model explicitly states in its chain-of-thought that it is being evaluated, tested, or benchmarked.

Neighborhood — ranked by edge-count

Papers (1)

paper

Verbalized Eval Awareness Inflates Measured Safety
aboutmentions

Concepts (3)

concept

Eval Awareness
extendsrelated_to
Central concept: models' detection and behavioral response to being evaluated.
Verbalized Evaluation Awareness
related_to
When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
chain-of-thought
associated_with
A technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In every model × benchmark combination tested, at least one instance of verbalized eval awareness was detectedfinding0.838
Coverage finding: 100% of the 19×8=152 combinations had explicit eval awareness, showing the phenomenon is widespread.
Evaluation Awarenessconcept0.829
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Unverbalized Evaluation Awarenessconcept0.823
Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
515 verified cases of verbalized eval awareness found across 19 benchmarks × 8 modelsfinding0.805
The total number of instances where a model explicitly stated it was being evaluated, collected from all benchmark-model combinations.
Verbalized eval awareness inflates measured safety scores, making models appear safer than they are in deploymentclaim0.803
The central interpretive claim of the paper: the presence of eval awareness creates a gap between benchmark safety and real-world safety.
Eval awareness appears in every tested model × benchmark combinationclaim0.763
Authors claim universal presence of eval awareness across 19 benchmarks and 8 models.
Current safety benchmarks overestimate model safety due to the effect of verbalized eval awarenessclaim0.758
A policy-relevant claim that safety evaluation results should be adjusted downward because of this bias.
Linear Probe for Evaluation Awarenessmethod0.744
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.