claim
active
claim:the-conflict-between-the-model-s-existing-preferences-and-the-stated-training-objective-is-the-key-driver-of-alignment-faking-in-this-setup

The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setup

Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.