finding

active

finding:large-language-models-develop-surprisingly-coherent-yet-often-rigid-internal-preferences-as-they-scale

Large language models develop surprisingly coherent yet often rigid internal preferences as they scale

Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures

Source paper

extracted_from

Contemplative Agent

(2025) · Ruben Laukkonen · Fionn Inglis · Shamil Chandaria · Lars Sandved-Smith +4

Neighborhood — ranked by edge-count

Claims (1)

claim

Emptiness counters runaway optimization because no single goal is ever reified as absolute
supports
Specific claim about emptiness solving the paperclip maximizer alignment problem

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.839
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Do large language models monitor their own internal states?question0.819
Framing question that motivates the entire paper
The inability for autoregressive large language models to maintain states of long-range order resembles tangential speech or derailment in formal thought disorder.claim0.817
Analogy between LLM incoherence and schizophrenia symptoms
Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.815
Central research question of the paper
Today's Large Language Models have become so good at playing Turing's game that it often takes experts to demonstrate the present limits of their ability to simulate human-like intelligence.claim0.808
Paper's assessment of current LLM capabilities relative to Turing Test
Can Large Language Models Genuinely Shift Human Perspectivequestion0.805
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.801
Motivation for using sparsity-based dictionary learning on language models
Can language models genuinely introspect on internal states or only confabulate?question0.799
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.