finding
active
finding:therapy-and-philosophical-ai-discussions-cause-the-largest-persona-drift-away-from-the-assistant-across-all-three-target-models-and-all-three-auditor-models-coding-and-writing-conversations-show-minimal-driftTherapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal drift
Identifies conversation domain as a key driver of persona drift
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central interpretive claim and motivation for future work
Questions (1)
question
- Second of two central questions motivating the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Empirical characterization of conversation domains that are safe for model persona stability
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.812Motivates the multi-turn conversation drift experiments in §4
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.800Core predictive hypothesis linking activation representations to behavioral outcomes
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Key rhetorical and philosophical argument establishing continuity between AI concerns and child-rearing
- Central open question raised by the paper.