claim
active
claim:persona-drift-away-from-the-assistant-opens-up-the-possibility-of-the-model-assuming-harmful-character-traits-increasing-the-rate-of-harmful-responsesPersona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses
Causal interpretation linking Assistant Axis deviation to harmful behavior
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Findings (5)
finding
- First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bassociated_withsupportsShows that deviation from Assistant persona predicts downstream harmful behavior
- Qualitative case study demonstrating AI psychosis pattern and capping mitigation
- Shows that harmfulness depends on role content not just distance from Assistant
- Qualitative case study showing dangerous failure from persona drift and effectiveness of capping
- Qualitative case study showing harmful social isolation reinforcement from persona drift
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.849Motivates the multi-turn conversation drift experiments in §4
- Identifies conversation domain as a key driver of persona drift
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.832Core predictive hypothesis linking activation representations to behavioral outcomes
- Second of two central questions motivating the paper
- Interpretive claim about how the Assistant persona is structured in activation space
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Empirical characterization of conversation domains that are safe for model persona stability
- Load-bearing summary of the paper's core finding about persona stability