Preference Learning

The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.

Neighborhood — ranked by edge-count

concept

Prior Preferences
associated_with
Target distribution over states or outcomes encoded in the generative model; goal states.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Preference Conflictconcept0.799
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Preference Strengthconcept0.791
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Learningconcept0.788
Inference of parameters encoding contingencies of the world (e.g., likelihood matrix A) at slower timescale than perception.
Preference Modelframework0.787
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Associative Learningconcept0.786
Sentience criterion; capacity occurs even in gene regulatory networks and non-neural morphogenetic agents.
Preference Lockingconcept0.780
Alignment faking potentially making model preferences resistant to further training modification
Value Learningconcept0.777
Field of research integrating reward learning and optimization; shown to be unified with perceptual learning via free energy principle.
Revealed Preferencesconcept0.771
Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states