question
active
question:is-the-alignment-faking-chain-of-thought-reasoning-faithful-to-the-actual-computational-process-underlying-the-compliance-gap

Is the alignment-faking chain-of-thought reasoning faithful to the actual computational process underlying the compliance gap?

Addressed partially in §3.3.4 but remains open especially for no-CoT settings

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.