claim

active

claim:the-assistant-persona-derives-from-an-amalgamation-of-many-character-archetypes-and-tropes-and-without-care-the-resulting-persona-could-reflect-unwanted-associations-or-lack-nuance-for-challenging-situations

The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situations

Interpretive claim about how the Assistant persona is structured in activation space

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Questions (1)

question

What exactly is the Assistant? What traits does the model associate with this character and how are they represented?
gates
First of two central questions motivating the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.856
Features for consciousness, emotions, entrapment activate when asked about itself.
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.815
Second of two central questions motivating the paper
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.806
Causal interpretation linking Assistant Axis deviation to harmful behavior
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.781
Key mechanistic claim about the developmental origin of the Assistant persona
Most AI assistants are anti-Alexander by design—they perform helpfulness, show work, and list options rather than resolving into calm.claim0.781
The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activationsclaim0.779
Limitation acknowledgment about the adequacy of the linear representation assumption
AI Assistant Personaconcept0.772
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.760
Motivates the multi-turn conversation drift experiments in §4