framework
active
framework:preference-modelPreference Model
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Frameworks (2)
framework
- Reinforcement Learning Constitutional AIimplementsThe RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
- Reinforcement Learning from AI FeedbackimplementsVariant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Comparing models using log-evidence approximated by free energy.
- A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
- Key element for alignment faking: model's pre-existing preferences contradict the new training objective
- The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
- Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
- Primary test domain for manifold steering, including reasoning and ICL tasks
- Class of large language models designed to produce extended chain-of-thought before answering, studied in this paper
- The joint distribution over events in the world that generate observed data; the proposed endpoint of representational convergence