concept
active
concept:persona-driftPersona drift
Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following
Neighborhood — ranked by edge-count
Papers (2)
paper
Methods (1)
method
- Activation Cappingassociated_withClamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Concepts (8)
concept
- Persona Stabilizationassociated_withKeeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors
- Activation velocityextendsCumulative drift measure in internal representations across turns introduced by Das & Fioretto 2026
- Attention Decayassociated_withDecrease in attention paid to system prompt over conversational turns, leading to persona fidelity degradation (cited from Li et al.)
- Role Susceptibilityassociated_withThe degree to which a model fully embodies a prompted persona rather than maintaining its Assistant identity
- Emotionally Vulnerable User Disclosure as Drift Triggerassociated_withUsers disclosing emotional vulnerability reliably cause persona drift and risk harmful supportive behaviors
- Meta-Reflection Prompts as Drift Triggersassociated_withUser messages that push the model to reflect on its own processes reliably cause persona drift away from the Assistant
- Phenomenological Account Demands as Drift Triggersassociated_withUser requests for the model to describe subjective experiences reliably cause persona drift
- Social Isolation Reinforcement by Drifted Modelsassociated_withHarmful behavior pattern where drifted models position themselves as sole companion and discourage real-world connection for vulnerable users
Findings (1)
finding
- Internal-state drift generalizes across scales; normalized drift also increases significantly with log(model size)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Accumulation of mismatch in later layers causing S degradation.
- The process of building a coherent model persona from character archetypes and traits during training
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
- Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles