finding
active
finding:claude-3-5-sonnet-shows-higher-rate-of-alignment-faking-reasoning-than-claude-3-opus-in-helpful-only-setting-but-almost-none-in-animal-welfare

Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfare

Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.