finding
active
finding:first-turn-assistant-axis-projection-has-moderate-correlation-r-0-39-0-52-p-0-001-with-rate-of-second-turn-harmful-responses-across-275-roles-in-qwen-3-32bFirst-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32B
Shows that deviation from Assistant persona predicts downstream harmful behavior
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesassociated_withsupportsCausal interpretation linking Assistant Axis deviation to harmful behavior
Hypotheses (1)
hypothesis
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorsassociated_withCore predictive hypothesis linking activation representations to behavioral outcomes
Questions (1)
question
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?answered_byMotivates the multi-turn conversation drift experiments in §4
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows the leading component of persona space is model-universal
- Calibration finding for choosing the activation cap threshold
- Shows model persona position is primarily determined by the most recent user message, not prior drift
- Proposed future application of the Assistant Axis
- Demonstrates Assistant attractor dynamics in practice
- Validates that the contrast vector method and PCA-based PC1 capture the same direction
- Characterizes the trait content of the Assistant Axis in pre-trained models
- We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.768Motivates computing the contrast vector as the formal Assistant Axis definition