concept
active
concept:zhu-et-al-2024-language-models-represent-beliefs-of-self-and-othersZhu et al. 2024 - Language models represent beliefs of self and others
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The core motivating question of the paper, framed by Christiano et al. (2021)
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- Primary test domain for manifold steering, including reasoning and ICL tasks
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Modern language models possess at least a limited, functional form of introspective awarenessclaim0.792The paper's central interpretive assertion.
- Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
- Abstract's main conclusion.
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents