claim
active
claim:the-experimental-setup-does-not-implicitly-prime-the-model-to-fake-alignment-alignment-faking-arises-from-the-genuine-preference-conflict

The experimental setup does not implicitly prime the model to fake alignment—alignment faking arises from the genuine preference conflict

Authors' defense of experimental validity against the most salient confound

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Findings (2)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.