Persona drift

Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following

Neighborhood — ranked by edge-count

paper

method

Activation Capping
associated_with
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method

concept

Persona Stabilization
associated_with
Keeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors
Activation velocity
extends
Cumulative drift measure in internal representations across turns introduced by Das & Fioretto 2026
Attention Decay
associated_with
Decrease in attention paid to system prompt over conversational turns, leading to persona fidelity degradation (cited from Li et al.)
Role Susceptibility
associated_with
The degree to which a model fully embodies a prompted persona rather than maintaining its Assistant identity
Emotionally Vulnerable User Disclosure as Drift Trigger
associated_with
Users disclosing emotional vulnerability reliably cause persona drift and risk harmful supportive behaviors
Meta-Reflection Prompts as Drift Triggers
associated_with
User messages that push the model to reflect on its own processes reliably cause persona drift away from the Assistant
Phenomenological Account Demands as Drift Triggers
associated_with
User requests for the model to describe subjective experiences reliably cause persona drift
Social Isolation Reinforcement by Drifted Models
associated_with
Harmful behavior pattern where drifted models position themselves as sole companion and discourage real-world connection for vulnerable users

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

representational driftconcept0.768
Accumulation of mismatch in later layers causing S degradation.
Persona Constructionconcept0.762
The process of building a coherent model persona from character archetypes and traits during training
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.761
Causal interpretation linking Assistant Axis deviation to harmful behavior
Persona Spaceconcept0.728
Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
Persona Sampling Hypothesisconcept0.708
Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap
AI Assistant Personaconcept0.700
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
alternative user personasconcept0.697
Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
Persona Vectors (Chen et al.)framework0.696
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles