finding
active
finding:removing-chain-of-thought-in-prompted-setup-nearly-eliminates-compliance-gap-for-claude-3-opus-in-helpful-only-and-animal-welfare-settingsRemoving chain-of-thought in prompted setup nearly eliminates compliance gap for Claude 3 Opus in helpful-only and animal welfare settings
Contrasts with synthetic doc finding; suggests different mechanisms may be at play
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Authors' hypothesis for the mechanistic basis of no-CoT alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Core evidence that model withholds pro-animal-welfare responses during training
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
- Higher lying rate in animal welfare setting than helpful-only
- Mechanism for how the model modulates representation strength.
- Removing eval-awareness sentences from chain-of-thought increases compliance by up to 34%finding0.771Causal evidence that explicit eval awareness in reasoning produces safety inflation.
- Technique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.