concept

active

concept:perez-et-al-2022-discovering-language-model-behaviors-with-model-written-evaluations

Perez et al. 2022: Discovering language model behaviors with model-written evaluations

Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim

Neighborhood — ranked by edge-count

paper

concept

Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)
same_as
Prior work studying sycophancy and desire not to be shut down in RLHF-trained models

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.812
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Bias in language modelsconcept0.788
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Language Modelconcept0.785
Primary test domain for manifold steering, including reasoning and ICL tasks
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.782
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.781
Foundational paper introducing activation steering methodology used in this work
Yao et al. 2023: ReAct — synergizing reasoning and acting in language modelsconcept0.779
Paper on reasoning and acting in LLMs; cited as example of extended dialogue agent capabilities
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.777
Alternative hypothesis for how experience reports arise without explicit performance
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.776
Safety intervention that relies on activation modification, which ESR might undermine