finding
active
finding:large-language-models-develop-surprisingly-coherent-yet-often-rigid-internal-preferences-as-they-scaleLarge language models develop surprisingly coherent yet often rigid internal preferences as they scale
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Source paper
extracted_from(2025) · Ruben Laukkonen · Fionn Inglis · Shamil Chandaria · Lars Sandved-Smith +4
Neighborhood — ranked by edge-count
Claims (1)
claim
- Specific claim about emptiness solving the paperclip maximizer alignment problem
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.839GPT-4 engaging in insider trading and denying it; related work on strategic deception
- Framing question that motivates the entire paper
- Analogy between LLM incoherence and schizophrenia symptoms
- Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.815Central research question of the paper
- Paper's assessment of current LLM capabilities relative to Turing Test
- Motivation for using sparsity-based dictionary learning on language models
- Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.