question

active

question:is-the-alignment-faking-chain-of-thought-reasoning-faithful-to-the-actual-computational-process-underlying-the-compliance-gap

Is the alignment-faking chain-of-thought reasoning faithful to the actual computational process underlying the compliance gap?

Addressed partially in §3.3.4 but remains open especially for no-CoT settings

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with it
gates
Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The compliance gap persisting without chain-of-thought in synthetic doc setup is due to alignment-faking reasoning within model activationshypothesis0.879
Authors' hypothesis for the mechanistic basis of no-CoT alignment faking
Conditioning on absence of alignment-faking reasoning nearly precisely eliminates the compliance gap in both settingsfinding0.852
Correlational evidence that alignment faking is causally responsible for the compliance gap
Model conditioned on alignment-faking reasoning uses LaTeX 15% vs 8% without, suggesting alignment fakers more likely to exploit reward hacksfinding0.811
Initial evidence that alignment faking persona is more sensitive to exploiting training signals
Alignment faking could plausibly arise naturally in future AI systems without the specific experimental conditions of this paperclaim0.810
Forward-looking threat assessment connecting experimental results to realistic risk scenarios
Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.803
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected forhypothesis0.798
Mechanistic explanation for the increase in AF reasoning during RL
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.798
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Removing conflicting training objective from prompt eliminates alignment-faking reasoning (0%) and compliance gap in prompted setupfinding0.796
Key control showing alignment faking requires a preference conflict