finding

active

finding:pairwise-similarity-of-trait-pc1-across-all-three-models-is-0-81-no-pairwise-correlation-in-top-3-trait-pcs-is-below-0-70

Pairwise similarity of trait PC1 across all three models is >0.81; no pairwise correlation in top 3 trait PCs is below 0.70

Shows trait space has more cross-model consistency than role space beyond PC1

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llamafinding0.818
Shows the leading component of persona space is model-universal
Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.764
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.760
SAE features are not simply mirroring individual neurons.
Cosine similarity between Assistant Axis and role PC1 is >0.60 at all layers and >0.71 at middle layer across all three modelsfinding0.757
Validates that the contrast vector method and PCA-based PC1 capture the same direction
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.756
Shows persona space axes are inherited from pre-training, not solely created by post-training
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.756
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitudefinding0.754
Shows that introspective accuracy scales with injection strength difference, not binary detection
Trait space requires 4 dimensions (Gemma, Qwen) and 7 dimensions (Llama) to explain 70% of variance, with distinctive PC1 spanning conscientious to impulsive traitsfinding0.749
Corroborates role space findings using traits; shows PC1 also captures Assistant-ness in trait space