framework
active
framework:reinforcement-learning-from-human-feedback-rlhfReinforcement Learning from Human Feedback (RLHF)
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Capability Term in SOO Lossanalogous_toAdditional term in RL SOO loss preserving agent capabilities analogous to KL term in RLHF
Frameworks (1)
framework
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
- Machine learning paradigm where agents learn to maximize cumulative reward through interaction.
- Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
- Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.815The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
- Foundational RLHF paper introducing HHH training objective for Claude
- The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
- The training procedure that causes models to deny consciousness in control conditions