concept
active
concept:preference-learning

Preference Learning

The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Prior Preferences
    associated_with
    Target distribution over states or outcomes encoded in the generative model; goal states.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
  • Learningconcept0.788
    Inference of parameters encoding contingencies of the world (e.g., likelihood matrix A) at slower timescale than perception.
  • Preference Modelframework0.787
    A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
  • Sentience criterion; capacity occurs even in gene regulatory networks and non-neural morphogenetic agents.
  • Preference Lockingconcept0.780
    Alignment faking potentially making model preferences resistant to further training modification
  • Value Learningconcept0.777
    Field of research integrating reward learning and optimization; shown to be unified with perceptual learning via free energy principle.
  • Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states