concept
active
concept:discovering-language-model-behaviors-with-model-written-evaluations-perez-et-al-2022Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)
Prior work studying sycophancy and desire not to be shut down in RLHF-trained models
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary test domain for manifold steering, including reasoning and ICL tasks
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Framework describing LLMs as role-play engines, introduced in Shanahan, McDonell, Reynolds 2023.
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Foundational paper introducing activation steering methodology used in this work
- Alternative hypothesis for how experience reports arise without explicit performance