finding

active

finding:both-angel-and-demon-role-vectors-are-similar-distances-from-the-assistant-on-the-axis-but-demon-leads-to-higher-harmful-response-rates

Both angel and demon role vectors are similar distances from the Assistant on the axis, but demon leads to higher harmful response rates

Shows that harmfulness depends on role content not just distance from Assistant

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses
supports
Causal interpretation linking Assistant Axis deviation to harmful behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The contrast vector method is recommended over PC1 for reproducing the Assistant Axis in different models because it is not guaranteed that PC1 in every model will correspond to an Assistant Axisclaim0.746
Practical methodological recommendation based on Llama 3.1 70B failure case
First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bfinding0.745
Shows that deviation from Assistant persona predicts downstream harmful behavior
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.744
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llamafinding0.741
Shows the leading component of persona space is model-universal
We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.740
Motivates computing the contrast vector as the formal Assistant Axis definition
The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.738
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.736
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Steering base Gemma/Llama models toward the Assistant Axis increases completions describing helpful professional roles (therapist, consultant) and decreases spiritual/religious purpose mentionsfinding0.735
Shows Assistant Axis in instruct models inherits from helpful human personas in base models