method
active
method:linear-probe-for-evaluation-awarenessLinear Probe for Evaluation Awareness
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Neighborhood — ranked by edge-count
Methods (1)
method
- Linear Proberelated_toSimple linear classifiers trained on model activations used as the probing technique within the introduced method.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method for fitting a linear classifier on collected activations to predict task-relevant features
- Used to evaluate representation quality across VTAB tasks
- Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender taskfinding0.785Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
- Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
- When the model explicitly mentions being tested in its chain-of-thought reasoning; distinguished from behavioral evaluation awareness.
- Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance