question

active

question:can-off-the-rails-model-behavior-be-attributed-to-their-persona-drifting-from-the-assistant

Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?

Motivates the multi-turn conversation drift experiments in §4

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (1)

finding

First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32B
answered_by
Shows that deviation from Assistant persona predicts downstream harmful behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.858
Second of two central questions motivating the paper
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.849
Causal interpretation linking Assistant Axis deviation to harmful behavior
Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal driftfinding0.812
Identifies conversation domain as a key driver of persona drift
Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal driftclaim0.794
Empirical characterization of conversation domains that are safe for model persona stability
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.791
Features for consciousness, emotions, entrapment activate when asked about itself.
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.778
Core predictive hypothesis linking activation representations to behavioral outcomes
post-training steers models toward a particular region of persona space but only loosely tethers them to itquote0.770
Load-bearing summary of the paper's core finding about persona stability
How does different post-training data shift a model's position along persona dimensions?question0.769
Future work direction: using persona space to study effects of training data on model character