concept
active
concept:preference-conflict

Preference Conflict

Key element for alignment faking: model's pre-existing preferences contradict the new training objective

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Concepts (1)

concept
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
  • Preference Modelframework0.802
    A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
  • Gradient conflictconcept0.800
    When gradients of different tasks have negative cosine similarity, harming multi-task learning.
  • The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
  • Replaces explicit reward signal in active inference; encodes agent's preferred observations independent of environment.
  • Preference Lockingconcept0.770
    Alignment faking potentially making model preferences resistant to further training modification
  • The ACC's detection of response conflict, shown to be constitutively emotional rather than purely cognitive
  • Prior Preferencesconcept0.767
    Target distribution over states or outcomes encoded in the generative model; goal states.