Casper et al. 2023: Open problems and fundamental limitations of RLHF

Paper noting that RLHF guardrails can attenuate model expressivity and creativity; cited as ref 30

Neighborhood — ranked by edge-count

Concepts (1)

concept

Guardrails
about
Constraints imposed via fine-tuning to reduce harmful output; can reduce harm but also attenuate expressivity and creativity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RLHF Fine-Tuningconcept0.754
The training procedure that causes models to deny consciousness in control conditions
Reinforcement Learning from Human Feedback (RLHF)framework0.751
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.739
Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
RLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)claim0.737
Critique of competing approaches that motivates SOO as filling a gap
H9: Chinese moderate-RLHF converges near Claude under contemplative prompt.hypothesis0.727
Exploratory hypothesis supported by Kimi 7.74 under prompt
Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.726
Foundational RLHF paper introducing HHH training objective for Claude
Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservationfinding0.726
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
Quantitative criteria like those of Crump et al. are a good example of the kind of framework we need: explicitly laying out conditions in a way that reveals their value and limitations.claim0.725
Praise for the target framework's transparency.