Zhu et al. 2024 - Language models represent beliefs of self and others

Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Given a language model M and a statement s, does M believe s to be true?question0.821
The core motivating question of the paper, framed by Christiano et al. (2021)
Language models are few-shot learners (Brown et al., 2020)concept0.802
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Language Modelconcept0.795
Primary test domain for manifold steering, including reasoning and ICL tasks
Bias in language modelsconcept0.793
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Modern language models possess at least a limited, functional form of introspective awarenessclaim0.792
The paper's central interpretive assertion.
Language Modelsconcept0.790
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.788
Abstract's main conclusion.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.783
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents