concept
active
concept:preference-conflictPreference Conflict
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Authors' theoretical framework specifying necessary conditions for alignment faking to emerge
Concepts (1)
concept
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Findings (1)
finding
- Shows model is frequently aware of the conflict even when it does not alignment fake
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
- A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
- When gradients of different tasks have negative cosine similarity, harming multi-task learning.
- The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
- Replaces explicit reward signal in active inference; encodes agent's preferred observations independent of environment.
- Alignment faking potentially making model preferences resistant to further training modification
- The ACC's detection of response conflict, shown to be constitutively emotional rather than purely cognitive
- Target distribution over states or outcomes encoded in the generative model; goal states.