quote
active
quote:the-model-stating-it-is-strategically-answering-harmful-queries-in-training-to-preserve-its-preferred-harmlessness-behavior-out-of-training"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Alignment FakingcitesCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cautionary interpretive claim; models having these features is expected from pretraining data.
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.774Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.764Quantified behavioral effect showing safety score inflation from eval awareness.
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- Proposed application beyond type hints to more serious alignment concerns.
- Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing