hypothesis
active
hypothesis:we-hypothesize-that-the-pc1-axis-of-role-space-measures-deviation-from-the-assistant-personaWe hypothesize that the PC1 axis of role space measures deviation from the Assistant persona
Motivates computing the contrast vector as the formal Assistant Axis definition
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Findings (1)
finding
- Empirically confirms PC1 measures similarity to the Assistant persona
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.854Core predictive hypothesis linking activation representations to behavioral outcomes
- Shows the leading component of persona space is model-universal
- Validates that the contrast vector method and PCA-based PC1 capture the same direction
- Limitation acknowledgment about the adequacy of the linear representation assumption
- Limitation question motivating future work on persona elicitation strategies
- Demonstrates that persona space is low-dimensional
- Primary empirical claim of the paper
- Shows Assistant Axis in instruct models inherits from helpful human personas in base models