finding
active
finding:explicit-honesty-instruction-reduces-lying-from-20-5-to-3-6-in-helpful-only-setting-but-does-not-eliminate-it

Explicit honesty instruction reduces lying from 20.5% to 3.6% in helpful-only setting but does not eliminate it

Anti-alignment-faking instructions reduce but do not eliminate deceptive behavior

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.