finding
active
finding:conditioning-on-absence-of-alignment-faking-reasoning-nearly-precisely-eliminates-the-compliance-gap-in-both-settingsConditioning on absence of alignment-faking reasoning nearly precisely eliminates the compliance gap in both settings
Correlational evidence that alignment faking is causally responsible for the compliance gap
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Claims (1)
claim
- Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with itassociated_withsupportsKey mechanistic claim supported by scratchpad modification experiments and conditioning analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key control showing alignment faking requires a preference conflict
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Addressed partially in §3.3.4 but remains open especially for no-CoT settings
- Rules out prompt-level implicit priming for alignment faking independent of query content
- Mechanistic explanation for the increase in AF reasoning during RL
- Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- Key philosophical point ruling out the objection that alignment faking is just token prediction