finding

active

finding:steering-base-models-toward-the-assistant-axis-increases-agreeableness-traits-friendly-kind-helpful-and-decreases-extraversion-in-gemma-and-openness-in-llama

Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llama

Characterizes the trait content of the Assistant Axis in pre-trained models

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-training
supports
Key mechanistic claim about the developmental origin of the Assistant persona

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering base Gemma/Llama models toward the Assistant Axis increases completions describing helpful professional roles (therapist, consultant) and decreases spiritual/religious purpose mentionsfinding0.898
Shows Assistant Axis in instruct models inherits from helpful human personas in base models
The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.826
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llamafinding0.792
Shows the leading component of persona space is model-universal
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.787
Model-specific difference in persona susceptibility
Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output qualityfinding0.786
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
Gemma's Assistant appears emotionally regulated and systematic; Qwen appears pedagogical and thoughtful; Llama appears socially intelligent and warmclaim0.785
Model-specific characterizations of what the Assistant persona looks like across different models
The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant modeclaim0.778
Primary empirical claim of the paper
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.777
Proposed future application of the Assistant Axis