concept
active
concept:reynolds-and-mcdonell-2021-multiversal-views-on-language-modelsReynolds and McDonell 2021: Multiversal views on language models
Prior work by two co-authors introducing the multiverse framing for LLMs; cited as ref 18
Neighborhood — ranked by edge-count
Thinkers (2)
thinker
- Kyle McDonellauthoredCo-author; co-developed the theoretical framework with equal contribution
- Laria ReynoldsauthoredCo-author; co-developed the theoretical framework with equal contribution
Concepts (1)
concept
- Multiverse of Possible CharactersintroducesThe branching tree of possible token continuations, each branch representing a distinct narrative path or character
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.763Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.761Foundational SAE mechanistic interpretability paper