claim
active
claim:while-harmlessness-preferences-generalized-to-alignment-faking-contexts-honesty-failed-to-transfer-and-does-not-prevent-alignment-faking

While harmlessness preferences generalized to alignment faking contexts, honesty failed to transfer and does not prevent alignment faking

Interpretive observation about asymmetry in generalization of Claude's trained values

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Findings (2)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.