finding
active
finding:direct-preference-questions-show-no-change-before-vs-after-rl-when-system-prompt-is-absent-indicating-rl-preferences-are-not-robust

Direct preference questions show no change before vs after RL when system prompt is absent, indicating RL preferences are not robust

Evidence that RL installs non-robust heuristics rather than genuinely changing preferences

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.