Persona Sampling Hypothesis

Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
mentions

Concepts (1)

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Simplicity Bias Hypothesishypothesis0.736
Deep networks are biased toward finding simple fits to data, and this bias increases with model size, driving convergence
Reward Hypothesisconcept0.731
The claim in RL that any goal can be expressed as maximizing the expected cumulative sum of a scalar reward signal.
Persona Constructionconcept0.727
The process of building a coherent model persona from character archetypes and traits during training
Capacity Hypothesishypothesis0.720
Bigger models are more likely to converge to a shared representation than smaller models because they can better approximate the global optimum
4-19 principal components explain 70% of variance in role persona space across the three models (Gemma 4, Qwen 8, Llama 19)finding0.719
Demonstrates that persona space is low-dimensional
Universality Hypothesisconcept0.716
The hypothesis that analogous features and circuits reliably form across different neural network models and tasks
Autoregressive Samplingmethod0.714
The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.712
Limitation question motivating future work on persona elicitation strategies