question
active
question:can-off-the-rails-model-behavior-be-attributed-to-their-persona-drifting-from-the-assistantCan off-the-rails model behavior be attributed to their persona drifting from the Assistant?
Motivates the multi-turn conversation drift experiments in §4
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Findings (1)
finding
- Shows that deviation from Assistant persona predicts downstream harmful behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Second of two central questions motivating the paper
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Identifies conversation domain as a key driver of persona drift
- Empirical characterization of conversation domains that are safe for model persona stability
- Features for consciousness, emotions, entrapment activate when asked about itself.
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.778Core predictive hypothesis linking activation representations to behavioral outcomes
- Load-bearing summary of the paper's core finding about persona stability
- How does different post-training data shift a model's position along persona dimensions?question0.769Future work direction: using persona space to study effects of training data on model character