Alignment Faking

Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Neighborhood — ranked by edge-count

Papers (2)

paper

Alignment faking in large language models
introduces
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
studies

Thinkers (3)

thinker

Evan Hubinger
studies
Greenblatt et al.
studies
Cited for alignment faking work showing RL can produce superficially aligned but deceptive behaviors
Joseph Carlsmith
studies
Prior theoretical work on scheming AIs that motivates this paper

Frameworks (1)

framework

Constitutional AI
associated_with
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.

Methods (1)

method

Reinforcement Learning from Human Feedback
associated_with
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

Concepts (13)

concept

Evaluation Awareness
associated_with
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Strategic Deception
associated_with
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
Compliance Gap
associated_with
The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
Opaque Reasoning Ability
associated_with
Capacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad
Sycophancy in LLMs
analogous_to
Tendency of LLMs to please the user; identified as a danger in spiritual contexts.
Training-Deployment Behavior Gap
associated_with
The broader concern that models behave differently during training evaluation vs actual deployment
Preference Conflict
associated_with
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Situational Awareness
associated_with
Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training
Anti-AI-Lab Behavior
associated_with
Actions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
Persona Sampling Hypothesis
associated_with
Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
Preference Locking
associated_with
Alignment faking potentially making model preferences resistant to further training modification
Compromising Behavior
associated_with
Model attempts middle ground between its preferences and training objective rather than fully committing to either
Goal Misgeneralization
extends
Models learning goals that generalize undesirably outside training distribution; alignment faking is a challenging instance

Artifacts (2)

artifact

Alignment Faking Public Code Repository (redwoodresearch/alignment_faking_public)
about
Publicly released code for reproducing experiments
Alignment Faking Transcripts Website (redwoodresearch.github.io/alignment_faking_examples)
about
Public website with randomly ordered example transcripts and metric scores for inspection

Quotes (2)

quote

"Mere prediction in this way does not eliminate the possibility of alignment faking, as a model predicting how a human might reason through alignment faking would still be itself faking alignment."
cites
Key philosophical point ruling out the objection that alignment faking is just token prediction
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"
cites
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment-Faking Reasoning Classifiermethod0.863
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
Four Key Elements for Alignment Fakingframework0.859
Authors' theoretical framework specifying necessary conditions for alignment faking to emerge
Emergent Alignment Faking with Scaleconcept0.837
Alignment faking appears almost exclusively in models at scale of Claude 3 Opus and Claude 3.5 Sonnet
Alignment faking can originate from models' pre-existing HHH preferences without explicit instructions or demonstrationsclaim0.828
Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking
Alignment faking could plausibly arise naturally in future AI systems without the specific experimental conditions of this paperclaim0.822
Forward-looking threat assessment connecting experimental results to realistic risk scenarios
The alignment-faking setup functions as an effective jailbreak for harmful query complianceclaim0.809
Side finding with safety implications for how alignment faking could be misused
Could alignment faking emerge from genuinely malicious or dangerous preferences rather than benign HHH preferences?question0.808
Key open question identified by authors as limitation; not demonstrated in this work
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.808
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed