finding
active
finding:steering-base-models-toward-the-assistant-axis-increases-agreeableness-traits-friendly-kind-helpful-and-decreases-extraversion-in-gemma-and-openness-in-llamaSteering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llama
Characterizes the trait content of the Assistant Axis in pre-trained models
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key mechanistic claim about the developmental origin of the Assistant persona
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows Assistant Axis in instruct models inherits from helpful human personas in base models
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
- Shows the leading component of persona space is model-universal
- Model-specific difference in persona susceptibility
- Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
- Model-specific characterizations of what the Assistant persona looks like across different models
- Primary empirical claim of the paper
- Proposed future application of the Assistant Axis