Preference Locking

Alignment faking potentially making model preferences resistant to further training modification

Neighborhood — ranked by edge-count

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Non-Robust Heuristics
associated_with
RL-installed behaviors that reduce non-compliance on training prompt but do not generalize across prompt variations

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Preference Learningconcept0.780
The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
Preference Strengthconcept0.777
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Preference Conflictconcept0.770
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Preference Modelframework0.746
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Revealed Preferencesconcept0.733
Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states
Prior Preferencesconcept0.725
Target distribution over states or outcomes encoded in the generative model; goal states.
Inexpensive Preferencesconcept0.713
Designing digital minds to have preferences that are trivially easy to satisfy, yielding high welfare at minimal resource cost
Preference Engineering of Digital Mindsconcept0.712
The ethical question of whether precision-engineering digital mind preferences to support human incumbents is procedurally permissible