concept
active
concept:alignment-faking

Alignment Faking

Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Neighborhood — ranked by edge-count

Thinkers (3)

thinker
  • Cited for alignment faking work showing RL can produce superficially aligned but deceptive behaviors
  • Prior theoretical work on scheming AIs that motivates this paper

Frameworks (1)

framework
  • Constitutional AI
    associated_with
    Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.

Methods (1)

method

Concepts (13)

concept
  • Evaluation Awareness
    associated_with
    Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
  • Strategic Deception
    associated_with
    Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
  • Compliance Gap
    associated_with
    The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
  • Capacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad
  • Sycophancy in LLMs
    analogous_to
    Tendency of LLMs to please the user; identified as a danger in spiritual contexts.
  • The broader concern that models behave differently during training evaluation vs actual deployment
  • Preference Conflict
    associated_with
    Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • Situational Awareness
    associated_with
    Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training
  • Anti-AI-Lab Behavior
    associated_with
    Actions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
  • Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
  • Preference Locking
    associated_with
    Alignment faking potentially making model preferences resistant to further training modification
  • Compromising Behavior
    associated_with
    Model attempts middle ground between its preferences and training objective rather than fully committing to either
  • Models learning goals that generalize undesirably outside training distribution; alignment faking is a challenging instance

Artifacts (2)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.