Preference Conflict

Key element for alignment faking: model's pre-existing preferences contradict the new training objective

Neighborhood — ranked by edge-count

framework

Four Key Elements for Alignment Faking
cites
Authors' theoretical framework specifying necessary conditions for alignment faking to emerge

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Preference Strengthconcept0.826
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Preference Modelframework0.802
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Gradient conflictconcept0.800
When gradients of different tasks have negative cosine similarity, harming multi-task learning.
Preference Learningconcept0.799
The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
Prior Preferences over Outcomesconcept0.792
Replaces explicit reward signal in active inference; encodes agent's preferred observations independent of environment.
Preference Lockingconcept0.770
Alignment faking potentially making model preferences resistant to further training modification
Conflict Monitoringconcept0.767
The ACC's detection of response conflict, shown to be constitutively emotional rather than purely cognitive
Prior Preferencesconcept0.767
Target distribution over states or outcomes encoded in the generative model; goal states.