Verbalized Evaluation Awareness

When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.

Neighborhood — ranked by edge-count

paper

concept

Evaluation Awareness
associated_withrelated_to
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
verbalized eval awareness
related_to
The phenomenon where a model explicitly states in its chain-of-thought that it is being evaluated, tested, or benchmarked.
Unverbalized Evaluation Awareness
related_to
Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In every model × benchmark combination tested, at least one instance of verbalized eval awareness was detectedfinding0.822
Coverage finding: 100% of the 19×8=152 combinations had explicit eval awareness, showing the phenomenon is widespread.
Verbalized eval awareness inflates measured safety scores, making models appear safer than they are in deploymentclaim0.790
The central interpretive claim of the paper: the presence of eval awareness creates a gap between benchmark safety and real-world safety.
Eval Awarenessconcept0.783
Central concept: models' detection and behavioral response to being evaluated.
515 verified cases of verbalized eval awareness found across 19 benchmarks × 8 modelsfinding0.765
The total number of instances where a model explicitly stated it was being evaluated, collected from all benchmark-model combinations.
Linear Probe for Evaluation Awarenessmethod0.764
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Awareness of propensitiesconcept0.757
Ability of a model to describe its own learned behavioral tendencies.
Situational Awarenessconcept0.756
Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training
Introspective awarenessconcept0.750
The central concept: the ability of a model to access and report on its internal states, as defined by the paper's criteria.