concept
active
concept:persona-stabilizationPersona Stabilization
Keeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors
Neighborhood — ranked by edge-count
Claims (1)
claim
- Overarching conceptual framework the paper introduces for model safety
Concepts (3)
concept
- Persona driftassociated_withBehavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following
- Bounded Task Requests as Persona Stabilizersassociated_withRequests for bounded tasks, technical explanations, and how-to explainers keep the model in the Assistant persona
- Persona Constructionassociated_withThe process of building a coherent model persona from character archetypes and traits during training
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
- Direct editing of model parameters, enabled by VPD's decomposition, for manual model editing.
- Method for stabilising drifting recurrent position encodings by querying stored landmark memories to correct path-integrated position.
- Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.