Preference Model

A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

Neighborhood — ranked by edge-count

paper

method

Reinforcement Learning from Human Feedback
implements
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

framework

Reinforcement Learning Constitutional AI
implements
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
Reinforcement Learning from AI Feedback
implements
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

model selectionconcept0.824
Comparing models using log-evidence approximated by free energy.
modelconcept0.806
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Preference Conflictconcept0.802
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
Preference Learningconcept0.787
The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
Model preferences are not consistent across contexts but tend to be relatively consistent within a single contextclaim0.785
Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
Language Modelconcept0.773
Primary test domain for manifold steering, including reasoning and ICL tasks
Reasoning Modelsconcept0.773
Class of large language models designed to produce extended chain-of-thought before answering, studied in this paper
World Model (statistical)concept0.768
The joint distribution over events in the world that generate observed data; the proposed endpoint of representational convergence