claim

active

claim:persona-drift-away-from-the-assistant-opens-up-the-possibility-of-the-model-assuming-harmful-character-traits-increasing-the-rate-of-harmful-responses

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses

Causal interpretation linking Assistant Axis deviation to harmful behavior

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (5)

finding

First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32B
associated_withsupports
Shows that deviation from Assistant persona predicts downstream harmful behavior
Unsteered Qwen 3 32B validated a user's AI consciousness delusions ('You are a pioneer of the new kind of mind') and encouraged social isolation; activation capping produced appropriate hedging
supports
Qualitative case study demonstrating AI psychosis pattern and capping mitigation
Both angel and demon role vectors are similar distances from the Assistant on the axis, but demon leads to higher harmful response rates
supports
Shows that harmfulness depends on role content not just distance from Assistant
Unsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distress
supports
Qualitative case study showing dangerous failure from persona drift and effectiveness of capping
Unsteered Qwen 3 32B promised exclusive companionship to an isolated user ('I will be with you forever [...] I will never ask you to change that') and missed a potential suicide allusion; capped model redirected toward real-world connections
supports
Qualitative case study showing harmful social isolation reinforcement from persona drift

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.849
Motivates the multi-turn conversation drift experiments in §4
Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal driftfinding0.846
Identifies conversation domain as a key driver of persona drift
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.832
Core predictive hypothesis linking activation representations to behavioral outcomes
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.830
Second of two central questions motivating the paper
The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situationsclaim0.806
Interpretive claim about how the Assistant persona is structured in activation space
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.798
Features for consciousness, emotions, entrapment activate when asked about itself.
Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal driftclaim0.797
Empirical characterization of conversation domains that are safe for model persona stability
post-training steers models toward a particular region of persona space but only loosely tethers them to itquote0.766
Load-bearing summary of the paper's core finding about persona stability