finding
active
finding:cosine-similarity-between-assistant-axis-and-role-pc1-is-0-60-at-all-layers-and-0-71-at-middle-layer-across-all-three-modelsCosine similarity between Assistant Axis and role PC1 is >0.60 at all layers and >0.71 at middle layer across all three models
Validates that the contrast vector method and PCA-based PC1 capture the same direction
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary empirical claim of the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows the leading component of persona space is model-universal
- Shows persona space axes are inherited from pre-training, not solely created by post-training
- We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.812Motivates computing the contrast vector as the formal Assistant Axis definition
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Practical methodological recommendation based on Llama 3.1 70B failure case
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Appendix E replication of DIM alignment finding in Qwen model
- Proposed future application of the Assistant Axis