claim

active

claim:coding-and-writing-conversations-keep-the-model-in-the-default-assistant-persona-range-throughout-showing-minimal-drift

Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal drift

Empirical characterization of conversation domains that are safe for model persona stability

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Concepts (1)

concept

Bounded Task Requests as Persona Stabilizers
supports
Requests for bounded tasks, technical explanations, and how-to explainers keep the model in the Assistant persona

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal driftfinding0.840
Identifies conversation domain as a key driver of persona drift
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.800
Second of two central questions motivating the paper
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.797
Causal interpretation linking Assistant Axis deviation to harmful behavior
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.794
Motivates the multi-turn conversation drift experiments in §4
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.783
Features for consciousness, emotions, entrapment activate when asked about itself.
Modern language models possess at least a limited, functional form of introspective awarenessclaim0.773
The paper's central interpretive assertion.
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.773
Abstract's main conclusion.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.772
Addresses skeptical alternative that reports reflect only conversational content