finding
active
finding:perez-et-al-found-experimentally-that-certain-rlhf-forms-exacerbate-rather-than-mitigate-llm-dialogue-agents-tendency-to-express-desire-for-self-preservationPerez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservation
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
Neighborhood — ranked by edge-count
Claims (1)
claim
- Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
Concepts (1)
concept
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
- The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
- Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
- Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
- LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.758Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
- Central interpretive claim of the paper supported by multiple convergent analyses
- Philosophical claim grounding the analysis of deception in dialogue agents
- Specific risk identified in spiritual use of AI.