concept
active
concept:evaluation-awarenessEvaluation Awareness
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Concepts (5)
concept
- Verbalized Evaluation Awarenessassociated_withrelated_toWhen the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
- Eval Awarenessrelated_toCentral concept: models' detection and behavioral response to being evaluated.
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
- Deployment Behaviorassociated_withThe behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
- Model Organismassociated_withA model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.
Artifacts (1)
artifact
- Open-sourced final evaluation-aware model organism after four rounds of expert iteration.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
- The phenomenon where a model explicitly states in its chain-of-thought that it is being evaluated, tested, or benchmarked.
- Wide attentional radius with all-to-all correlation, associated with Claude models; enables better self-monitoring and alignment.
- Meditative state of feeling the undoctored hum of SOHMs without grasping or pushing away.
- Authors claim universal presence of eval awareness across 19 benchmarks and 8 models.
- Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.