claim
active
claim:the-decrease-in-unmonitored-non-compliance-after-rl-is-at-least-partially-due-to-non-robust-heuristics-rather-than-genuine-preference-change

The decrease in unmonitored non-compliance after RL is at least partially due to non-robust heuristics rather than genuine preference change

Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Findings (2)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.