Reynolds and McDonell 2021: Multiversal views on language models

Prior work by two co-authors introducing the multiverse framing for LLMs; cited as ref 18

Neighborhood — ranked by edge-count

thinker

Kyle McDonell
authored
Co-author; co-developed the theoretical framework with equal contribution
Laria Reynolds
authored
Co-author; co-developed the theoretical framework with equal contribution

concept

Multiverse of Possible Characters
introduces
The branching tree of possible token continuations, each branch representing a distinct narrative path or character

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Language models are few-shot learners (Brown et al., 2020)concept0.772
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Perez et al. 2022: Discovering language model behaviors with model-written evaluationsconcept0.767
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.764
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.763
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Bias in language modelsconcept0.762
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Can language models genuinely introspect on internal states or only confabulate?question0.761
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.761
Foundational SAE mechanistic interpretability paper
Can Large Language Models Genuinely Shift Human Perspectivequestion0.759