concept
active
concept:persona-constructionPersona Construction
The process of building a coherent model persona from character archetypes and traits during training
Neighborhood — ranked by edge-count
Claims (1)
claim
- Overarching conceptual framework the paper introduces for model safety
Concepts (1)
concept
- Persona Stabilizationassociated_withKeeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following
- Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Speaking style induced by extreme steering away from the Assistant; characterized by mystical, poetic, theatrical prose
- Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
- Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
- A system component outside the application domain that provides infrastructure (e.g., backplane, interface repository).
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles