Persona Stabilization

Keeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors

Neighborhood — ranked by edge-count

claim

concept

Persona drift
associated_with
Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following
Bounded Task Requests as Persona Stabilizers
associated_with
Requests for bounded tasks, technical explanations, and how-to explainers keep the model in the Assistant persona
Persona Construction
associated_with
The process of building a coherent model persona from character archetypes and traits during training

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI Assistant Personaconcept0.751
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
Persona Spaceconcept0.725
Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.708
Causal interpretation linking Assistant Axis deviation to harmful behavior
Principled Control via Intervention on Internalsconcept0.708
The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions
Persona Vectors (Chen et al.)framework0.707
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
Parameter manipulationconcept0.703
Direct editing of model parameters, enabled by VPD's decomposition, for manual model editing.
Sensory Landmark Position Encoding Stabilizationmethod0.698
Method for stabilising drifting recurrent position encodings by querying stored landmark memories to correct path-integrated position.
alternative user personasconcept0.697
Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.