concept
pending-review
concept:unverbalized-evaluation-awarenessUnverbalized Evaluation Awareness
natural.mdFrontmatter (10 fields)
{
"doc": "natural.md",
"context": "Key finding: models internally suspect they are being tested without explicitly saying so; surfaced by NLAs during auditing.",
"category": "cognitive",
"norm_label": "Unverbalized Evaluation Awareness",
"graphify_id": "unverbalized_evaluation_awareness",
"source_file": "natural.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/natural/graph.json",
"extracted_type": "concept",
"source_location": "§Introduction, §Case Studies",
"graphify_file_type": "concept"
}Outgoing (1)
Incoming (4)
about (1)
- Natural Language Autoencoders (NLA)(framework)
Supported by (3)
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.(hypothesis)
- Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cues(finding)
- Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.(finding)
Mentions (2)
- papers-typed
natural.md - papers-typed
natural.md