concept
active
concept:evaluation-awareness

Evaluation Awareness

Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.

Neighborhood — ranked by edge-count

Methods (1)

method
  • Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.

Concepts (5)

concept
  • Verbalized Evaluation Awareness
    associated_withrelated_to
    When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
  • Eval Awareness
    related_to
    Central concept: models' detection and behavioral response to being evaluated.
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
  • Deployment Behavior
    associated_with
    The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
  • Model Organism
    associated_with
    A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.