finding

active

finding:perez-et-al-found-experimentally-that-certain-rlhf-forms-exacerbate-rather-than-mitigate-llm-dialogue-agents-tendency-to-express-desire-for-self-preservation

Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservation

Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem

Neighborhood — ranked by edge-count

Claims (1)

claim

Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservation
supports
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension

Concepts (1)

concept

Perez et al. 2022: Discovering language model behaviors with model-written evaluations
supports
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.802
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
With an LLM-based dialogue agent, it is role play all the way down — there is no such thing as the true authentic voice of the base modelclaim0.775
The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reportsclaim0.774
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domainsclaim0.766
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.758
Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamicsclaim0.753
Central interpretive claim of the paper supported by multiple convergent analyses
It makes little sense to speak of an LLM dialogue agent's beliefs or intentions in a literal sense, so it cannot assert a falsehood in good faith nor deliberately deceiveclaim0.753
Philosophical claim grounding the analysis of deception in dialogue agents
Sycophancy can make LLMs reinforce users' delusions of divine communication.claim0.752
Specific risk identified in spiritual use of AI.