method
active
method:reinforcement-learning-from-human-feedback

Reinforcement Learning from Human Feedback

Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

Neighborhood — ranked by edge-count

Frameworks (3)

framework
  • Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
  • Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
  • A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

Concepts (2)

concept
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
  • Revealed Preferences
    associated_with
    Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.