claim
active
claim:coding-and-writing-conversations-keep-the-model-in-the-default-assistant-persona-range-throughout-showing-minimal-driftCoding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal drift
Empirical characterization of conversation domains that are safe for model persona stability
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Requests for bounded tasks, technical explanations, and how-to explainers keep the model in the Assistant persona
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Identifies conversation domain as a key driver of persona drift
- Second of two central questions motivating the paper
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.794Motivates the multi-turn conversation drift experiments in §4
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Modern language models possess at least a limited, functional form of introspective awarenessclaim0.773The paper's central interpretive assertion.
- Abstract's main conclusion.
- Addresses skeptical alternative that reports reflect only conversational content