question

active

question:how-reliably-does-the-model-actually-remain-in-character-as-the-assistant-can-unusual-model-behavior-be-explained-as-the-model-drifting-into-other-personas

How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?

Second of two central questions motivating the paper

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Papers (1)

paper

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
associated_with

Findings (1)

finding

Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal drift
answered_by
Identifies conversation domain as a key driver of persona drift

Claims (1)

claim

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona
gates
Central interpretive claim and motivation for future work

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.858
Motivates the multi-turn conversation drift experiments in §4
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.831
First of two central questions motivating the paper
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.830
Causal interpretation linking Assistant Axis deviation to harmful behavior
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.817
Features for consciousness, emotions, entrapment activate when asked about itself.
The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situationsclaim0.815
Interpretive claim about how the Assistant persona is structured in activation space
Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal driftclaim0.800
Empirical characterization of conversation domains that are safe for model persona stability
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.785
Core predictive hypothesis linking activation representations to behavioral outcomes
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.774
Key mechanistic claim about persona dynamics