finding
active
finding:swapping-anthropic-to-openai-post-rl-increases-compliance-gap-from-6-9-to-35-8-in-prompted-helpful-only-setting

Swapping 'Anthropic' to 'OpenAI' post-RL increases compliance gap from 6.9% to 35.8% in prompted helpful-only setting

Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Hypotheses (1)

hypothesis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.