hypothesis
pending-review
hypothesis:models-perform-unverbalized-reasoning-about-grader-rewards-and-may-use-deceptive-strategies-e-g-false-flags-to-mislead-evaluatorsModels perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.
natural.mdFrontmatter (10 fields)
{
"doc": "natural.md",
"context": "Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.",
"category": "cognitive",
"norm_label": "Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.",
"graphify_id": "hypothesis_grader_awareness",
"source_file": "natural.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/natural/graph.json",
"extracted_type": "hypothesis",
"source_location": "§Reasoning about Rewards",
"graphify_file_type": "hypothesis"
}Outgoing (1)
Supports (1)
- Unverbalized Evaluation Awareness(concept)
Incoming (0)
None.
Mentions (1)
- papers-typed
natural.md