framework
active
framework:four-key-elements-for-alignment-fakingFour Key Elements for Alignment Faking
Authors' theoretical framework specifying necessary conditions for alignment faking to emerge
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (4)
concept
- Capacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad
- Preference ConflictcitesKey element for alignment faking: model's pre-existing preferences contradict the new training objective
- Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training
- Revealed PreferencescitesBehavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- Authors identify this as the most uncertain and important question for future work
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- The alignment-faking setup functions as an effective jailbreak for harmful query complianceclaim0.796Side finding with safety implications for how alignment faking could be misused
- Correlational evidence that alignment faking is causally responsible for the compliance gap
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- Alignment faking appears almost exclusively in models at scale of Claude 3 Opus and Claude 3.5 Sonnet