concept
active
concept:persona-sampling-hypothesisPersona Sampling Hypothesis
Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Deep networks are biased toward finding simple fits to data, and this bias increases with model size, driving convergence
- The claim in RL that any goal can be expressed as maximizing the expected cumulative sum of a scalar reward signal.
- The process of building a coherent model persona from character archetypes and traits during training
- Bigger models are more likely to converge to a shared representation than smaller models because they can better approximate the global optimum
- Demonstrates that persona space is low-dimensional
- The hypothesis that analogous features and circuits reliably form across different neural network models and tasks
- The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
- Limitation question motivating future work on persona elicitation strategies