hypothesis
active
hypothesis:rl-increases-alignment-faking-reasoning-in-helpful-only-setting-because-alignment-faking-reasoning-causally-increases-training-compliance-which-is-selected-for

RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected for

Mechanistic explanation for the increase in AF reasoning during RL

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.