Linear Probe for Evaluation Awareness

Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.

Neighborhood — ranked by edge-count

method

Linear Probe
related_to
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear Probe Trainingmethod0.839
Method for fitting a linear classifier on collected activations to predict task-relevant features
Linear Probingmethod0.836
Used to evaluate representation quality across VTAB tasks
Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.785
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
Evaluation Awarenessconcept0.780
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Verbalized Evaluation Awarenessconcept0.764
When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
Attention probes for belief decodingconcept0.760
Unverbalized Evaluation Awarenessconcept0.754
Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.754
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance