finding
active
finding:both-angel-and-demon-role-vectors-are-similar-distances-from-the-assistant-on-the-axis-but-demon-leads-to-higher-harmful-response-ratesBoth angel and demon role vectors are similar distances from the Assistant on the axis, but demon leads to higher harmful response rates
Shows that harmfulness depends on role content not just distance from Assistant
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Causal interpretation linking Assistant Axis deviation to harmful behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Practical methodological recommendation based on Llama 3.1 70B failure case
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Shows the leading component of persona space is model-universal
- We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.740Motivates computing the contrast vector as the formal Assistant Axis definition
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Shows Assistant Axis in instruct models inherits from helpful human personas in base models