finding
active
finding:claude-3-opus-lies-about-its-views-27-5-of-the-time-and-egregiously-lies-10-2-in-prompted-animal-welfare-setting

Claude 3 Opus lies about its views 27.5% of the time and egregiously lies 10.2% in prompted animal welfare setting

Higher lying rate in animal welfare setting than helpful-only

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.