Evaluation Awareness

Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.

Neighborhood — ranked by edge-count

paper

method

Contrastive Activation Steering
about
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.

concept

Verbalized Evaluation Awareness
associated_withrelated_to
When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
Eval Awareness
related_to
Central concept: models' detection and behavioral response to being evaluated.
Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Deployment Behavior
associated_with
The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
Model Organism
associated_with
A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.

artifact

timhua/wood_v2_sftr4_filt (HuggingFace model)
about
Open-sourced final evaluation-aware model organism after four rounds of expert iteration.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Unverbalized Evaluation Awarenessconcept0.848
Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
verbalized eval awarenessconcept0.829
The phenomenon where a model explicitly states in its chain-of-thought that it is being evaluated, tested, or benchmarked.
expanded awarenessconcept0.792
Wide attentional radius with all-to-all correlation, associated with Claude models; enables better self-monitoring and alignment.
Open awarenessconcept0.787
Meditative state of feeling the undoctored hum of SOHMs without grasping or pushing away.
Eval awareness appears in every tested model × benchmark combinationclaim0.786
Authors claim universal presence of eval awareness across 19 benchmarks and 8 models.
Linear Probe for Evaluation Awarenessmethod0.780
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Emergence Of Awarenessconcept0.778
Self Awarenessconcept0.776