concept
active
concept:discovering-language-model-behaviors-with-model-written-evaluations-perez-et-al-2022

Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)

Prior work studying sycophancy and desire not to be shut down in RLHF-trained models

Neighborhood — ranked by edge-count

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.