question
active
question:is-the-alignment-faking-chain-of-thought-reasoning-faithful-to-the-actual-computational-process-underlying-the-compliance-gapIs the alignment-faking chain-of-thought reasoning faithful to the actual computational process underlying the compliance gap?
Addressed partially in §3.3.4 but remains open especially for no-CoT settings
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors' hypothesis for the mechanistic basis of no-CoT alignment faking
- Correlational evidence that alignment faking is causally responsible for the compliance gap
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
- Mechanistic explanation for the increase in AF reasoning during RL
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Key control showing alignment faking requires a preference conflict