claim

active

claim:the-leading-component-of-the-persona-space-of-instruct-llms-is-an-assistant-axis-that-captures-the-extent-to-which-a-model-is-operating-in-its-default-assistant-mode

The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant mode

Primary empirical claim of the paper

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (3)

finding

Cosine similarity between Assistant Axis and role PC1 is >0.60 at all layers and >0.71 at middle layer across all three models
supports
Validates that the contrast vector method and PCA-based PC1 capture the same direction
Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llama
supports
Shows the leading component of persona space is model-universal
Trait space requires 4 dimensions (Gemma, Qwen) and 7 dimensions (Llama) to explain 70% of variance, with distinctive PC1 spanning conscientious to impulsive traits
supports
Corroborates role space findings using traits; shows PC1 also captures Assistant-ness in trait space

Claims (1)

claim

The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activations
contradicts
Limitation acknowledgment about the adequacy of the linear representation assumption

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.822
Key mechanistic claim about the developmental origin of the Assistant persona
Steering base Gemma/Llama models toward the Assistant Axis increases completions describing helpful professional roles (therapist, consultant) and decreases spiritual/religious purpose mentionsfinding0.799
Shows Assistant Axis in instruct models inherits from helpful human personas in base models
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.785
Key mechanistic claim about persona dynamics
We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpushypothesis0.781
Motivated by near-identical PCs for base and instruct Gemma
We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.781
Motivates computing the contrast vector as the formal Assistant Axis definition
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llamafinding0.778
Characterizes the trait content of the Assistant Axis in pre-trained models
A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behaviorclaim0.768
Core claim of ReflCtrl that a single direction captures and controls reflection
The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.765
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis