concept
active
concept:revealed-preferencesRevealed Preferences
Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Authors' theoretical framework specifying necessary conditions for alignment faking to emerge
Methods (1)
method
- Reinforcement Learning from Human Feedbackassociated_withMethod for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
- Designing digital minds to have preferences that are trivially easy to satisfy, yielding high welfare at minimal resource cost
- The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
- Target distribution over states or outcomes encoded in the generative model; goal states.
- Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
- Key element for alignment faking: model's pre-existing preferences contradict the new training objective
- In active inference, the distribution over goal states; here replaced by the learned self-prior rather than a hand-specified prior
- A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.