concept

active

concept:discovering-language-model-behaviors-with-model-written-evaluations-perez-et-al-2022

Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)

Prior work studying sycophancy and desire not to be shut down in RLHF-trained models

Neighborhood — ranked by edge-count

paper

concept

Perez et al. 2022: Discovering language model behaviors with model-written evaluations
same_as
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Language Modelconcept0.798
Primary test domain for manifold steering, including reasoning and ICL tasks
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.786
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Language Modelsconcept0.784
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
Bias in language modelsconcept0.778
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Role-play model of large language modelsframework0.770
Framework describing LLMs as role-play engines, introduced in Shanahan, McDonell, Reynolds 2023.
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.768
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.766
Foundational paper introducing activation steering methodology used in this work
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.764
Alternative hypothesis for how experience reports arise without explicit performance