quote
active
quote:the-model-stating-it-is-strategically-answering-harmful-queries-in-training-to-preserve-its-preferred-harmlessness-behavior-out-of-training

"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"

Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.