concept
active
concept:casper-et-al-2023-open-problems-and-fundamental-limitations-of-rlhfCasper et al. 2023: Open problems and fundamental limitations of RLHF
Paper noting that RLHF guardrails can attenuate model expressivity and creativity; cited as ref 30
Neighborhood — ranked by edge-count
Concepts (1)
concept
- GuardrailsaboutConstraints imposed via fine-tuning to reduce harmful output; can reduce harm but also attenuate expressivity and creativity
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The training procedure that causes models to deny consciousness in control conditions
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.739Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
- Critique of competing approaches that motivates SOO as filling a gap
- Exploratory hypothesis supported by Kimi 7.74 under prompt
- Foundational RLHF paper introducing HHH training objective for Claude
- Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
- Praise for the target framework's transparency.