Revealed Preferences

Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states

Neighborhood — ranked by edge-count

framework

Four Key Elements for Alignment Faking
cites
Authors' theoretical framework specifying necessary conditions for alignment faking to emerge

method

Reinforcement Learning from Human Feedback
associated_with
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Preference Learningconcept0.771
The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
Inexpensive Preferencesconcept0.770
Designing digital minds to have preferences that are trivially easy to satisfy, yielding high welfare at minimal resource cost
Preference Strengthconcept0.767
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Prior Preferencesconcept0.766
Target distribution over states or outcomes encoded in the generative model; goal states.
Selectivitymethod0.763
Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
Preference Conflictconcept0.762
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Preferred Distributionconcept0.754
In active inference, the distribution over goal states; here replaced by the learned self-prior rather than a hand-specified prior
Preference Modelframework0.742
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.