concept
active
concept:perez-et-al-2022-discovering-language-model-behaviors-with-model-written-evaluations

Perez et al. 2022: Discovering language model behaviors with model-written evaluations

Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim

Neighborhood — ranked by edge-count

Concepts (1)

concept

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.