question
active
question:how-reliably-does-the-model-actually-remain-in-character-as-the-assistant-can-unusual-model-behavior-be-explained-as-the-model-drifting-into-other-personasHow reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?
Second of two central questions motivating the paper
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Identifies conversation domain as a key driver of persona drift
Claims (1)
claim
- Central interpretive claim and motivation for future work
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.858Motivates the multi-turn conversation drift experiments in §4
- What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.831First of two central questions motivating the paper
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Interpretive claim about how the Assistant persona is structured in activation space
- Empirical characterization of conversation domains that are safe for model persona stability
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.785Core predictive hypothesis linking activation representations to behavioral outcomes
- Key mechanistic claim about persona dynamics