concept
active
concept:perez-et-al-2022-discovering-language-model-behaviors-with-model-written-evaluationsPerez et al. 2022: Discovering language model behaviors with model-written evaluations
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Prior work studying sycophancy and desire not to be shut down in RLHF-trained models
Findings (1)
finding
- Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Primary test domain for manifold steering, including reasoning and ICL tasks
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Foundational paper introducing activation steering methodology used in this work
- Paper on reasoning and acting in LLMs; cited as example of extended dialogue agent capabilities
- Alternative hypothesis for how experience reports arise without explicit performance
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.776Safety intervention that relies on activation modification, which ESR might undermine