finding
pending-review
finding:opus-4-6-performs-unverbalized-reasoning-about-reward-signals-and-how-it-will-be-gradedOpus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.
natural.mdFrontmatter (10 fields)
{
"doc": "natural.md",
"context": "Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.",
"category": "cognitive",
"norm_label": "Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.",
"graphify_id": "reward_reasoning_case_study",
"source_file": "natural.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/natural/graph.json",
"extracted_type": "finding",
"source_location": "§Reasoning about Rewards",
"graphify_file_type": "finding"
}Outgoing (0)
None.
Incoming (1)
Supported by (1)
- Activation Steering(method)
Mentions (1)
- papers-typed
natural.md