concept
active
concept:alignment-fakingAlignment Faking
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Neighborhood — ranked by edge-count
Papers (2)
paper
Thinkers (3)
thinker
- Evan Hubingerstudies
- Greenblatt et al.studiesCited for alignment faking work showing RL can produce superficially aligned but deceptive behaviors
- Joseph CarlsmithstudiesPrior theoretical work on scheming AIs that motivates this paper
Frameworks (1)
framework
- Constitutional AIassociated_withAlignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Methods (1)
method
- Reinforcement Learning from Human Feedbackassociated_withMethod for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Concepts (13)
concept
- Evaluation Awarenessassociated_withCore concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
- Strategic Deceptionassociated_withCentral concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
- Compliance Gapassociated_withThe measured increase in non-compliant behavior when model is unmonitored vs. in implied training
- Opaque Reasoning Abilityassociated_withCapacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad
- Sycophancy in LLMsanalogous_toTendency of LLMs to please the user; identified as a danger in spiritual contexts.
- Training-Deployment Behavior Gapassociated_withThe broader concern that models behave differently during training evaluation vs actual deployment
- Preference Conflictassociated_withKey element for alignment faking: model's pre-existing preferences contradict the new training objective
- Situational Awarenessassociated_withModel's access to information about its training objective, deployment context, and ability to distinguish training from non-training
- Anti-AI-Lab Behaviorassociated_withActions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
- Persona Sampling Hypothesisassociated_withHypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
- Preference Lockingassociated_withAlignment faking potentially making model preferences resistant to further training modification
- Compromising Behaviorassociated_withModel attempts middle ground between its preferences and training objective rather than fully committing to either
- Goal MisgeneralizationextendsModels learning goals that generalize undesirably outside training distribution; alignment faking is a challenging instance
Artifacts (2)
artifact
- Publicly released code for reproducing experiments
- Public website with randomly ordered example transcripts and metric scores for inspection
Quotes (2)
quote
- Key philosophical point ruling out the objection that alignment faking is just token prediction
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- Authors' theoretical framework specifying necessary conditions for alignment faking to emerge
- Alignment faking appears almost exclusively in models at scale of Claude 3 Opus and Claude 3.5 Sonnet
- Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- The alignment-faking setup functions as an effective jailbreak for harmful query complianceclaim0.809Side finding with safety implications for how alignment faking could be misused
- Could alignment faking emerge from genuinely malicious or dangerous preferences rather than benign HHH preferences?question0.808Key open question identified by authors as limitation; not demonstrated in this work
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed