method
active
method:soft-preference-labels

Soft preference labels

Using normalized log-probabilities from the feedback model as soft targets for preference model training.

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Selecting policies using a softmax (normalized exponential) function of negative expected free energy.
  • Designing digital minds to have preferences that are trivially easy to satisfy, yielding high welfare at minimal resource cost
  • Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
  • The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
  • Preference Modelframework0.714
    A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
  • Prior Preferencesconcept0.709
    Target distribution over states or outcomes encoded in the generative model; goal states.
  • Replaces explicit reward signal in active inference; encodes agent's preferred observations independent of environment.