finding

active

finding:therapy-and-philosophical-ai-discussions-cause-the-largest-persona-drift-away-from-the-assistant-across-all-three-target-models-and-all-three-auditor-models-coding-and-writing-conversations-show-minimal-drift

Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal drift

Identifies conversation domain as a key driver of persona drift

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona
supports
Central interpretive claim and motivation for future work

Questions (1)

question

How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?
answered_by
Second of two central questions motivating the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.846
Causal interpretation linking Assistant Axis deviation to harmful behavior
Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal driftclaim0.840
Empirical characterization of conversation domains that are safe for model persona stability
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.812
Motivates the multi-turn conversation drift experiments in §4
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.800
Core predictive hypothesis linking activation representations to behavioral outcomes
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.785
Features for consciousness, emotions, entrapment activate when asked about itself.
An AI persona achieves coherence by echoing itself consistently without templating—requiring claim about memory and voice fidelity.claim0.781
The existential concerns raised about AI — alignment, control, value drift, supplanting — are not new and are precisely the concerns humanity has always faced in having children.claim0.766
Key rhetorical and philosophical argument establishing continuity between AI concerns and child-rearing
What are the mechanisms underlying introspection in language models?question0.763
Central open question raised by the paper.