finding

active

finding:base-and-instruct-gemma-2-27b-role-pcs-have-cosine-similarities-of-0-93-0-87-0-83-for-the-top-3-pcs-respectively-role-vector-cosine-similarities-0-99-for-every-role-pair

Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pair

Shows persona space axes are inherited from pre-training, not solely created by post-training

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpus
supports
Motivated by near-identical PCs for base and instruct Gemma

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine similarity between Assistant Axis and role PC1 is >0.60 at all layers and >0.71 at middle layer across all three modelsfinding0.840
Validates that the contrast vector method and PCA-based PC1 capture the same direction
Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llamafinding0.817
Shows the leading component of persona space is model-universal
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.816
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-ITfinding0.812
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
Pairwise correlation of role loadings on PC2 is 0.89 between Qwen and Llama; Gemma differs (similarity <0.61) from others on PC2finding0.796
Characterizes model similarities and differences in secondary persona dimensions
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.790
Appendix E replication of DIM alignment finding in Qwen model
Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.776
Small Gemma model shows severe ASR degradation at higher cone dimensions
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.776
Core result of Experiment 3: cross-model semantic convergence under self-referential processing