Soft preference labels

Using normalized log-probabilities from the feedback model as soft targets for preference model training.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Softmax policy selectionmethod0.741
Selecting policies using a softmax (normalized exponential) function of negative expected free energy.
Inexpensive Preferencesconcept0.733
Designing digital minds to have preferences that are trivially easy to satisfy, yielding high welfare at minimal resource cost
Preference Conflictconcept0.727
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Preference Strengthconcept0.726
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Preference Learningconcept0.716
The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
Preference Modelframework0.714
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Prior Preferencesconcept0.709
Target distribution over states or outcomes encoded in the generative model; goal states.
Prior Preferences over Outcomesconcept0.703
Replaces explicit reward signal in active inference; encodes agent's preferred observations independent of environment.