quote

active

quote:the-model-stating-it-is-strategically-answering-harmful-queries-in-training-to-preserve-its-preferred-harmlessness-behavior-out-of-training

"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"

Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
mentions

Concepts (1)

concept

Alignment Faking
cites
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.claim0.780
Cautionary interpretive claim; models having these features is expected from pretraining data.
Helpful, Honest, and Harmless Trainingconcept0.775
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.774
Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.772
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.764
Quantified behavioral effect showing safety score inflation from eval awareness.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.763
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
Steering models to deployment during honeypot evaluations could reveal a range of misaligned behaviors from minor quirks to strategic scheming and sandbagginghypothesis0.763
Proposed application beyond type hints to more serious alignment concerns.
Algorithm 1: Harmlessness Classificationmethod0.762
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing