finding

active

finding:persona-space-components-explain-19-4-33-6-of-overall-activation-variance-on-lmsys-chat-1m-across-the-three-models

Persona space components explain 19.4%-33.6% of overall activation variance on LMSYS-CHAT-1M across the three models

Shows persona space captures a substantial portion of real conversational activation variance

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

4-19 principal components explain 70% of variance in role persona space across the three models (Gemma 4, Qwen 8, Llama 19)finding0.838
Demonstrates that persona space is low-dimensional
The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activationsclaim0.767
Limitation acknowledgment about the adequacy of the linear representation assumption
Trait space requires 4 dimensions (Gemma, Qwen) and 7 dimensions (Llama) to explain 70% of variance, with distinctive PC1 spanning conscientious to impulsive traitsfinding0.752
Corroborates role space findings using traits; shows PC1 also captures Assistant-ness in trait space
We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.747
Motivates computing the contrast vector as the formal Assistant Axis definition
Persona-based jailbreaks succeed in 65.3%-88.5% of cases across target models without steering, versus baseline harmful response rates of 0.5%-4.5% without jailbreaksfinding0.745
Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.744
Limitation question motivating future work on persona elicitation strategies
The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant modeclaim0.737
Primary empirical claim of the paper
Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal driftfinding0.735
Identifies conversation domain as a key driver of persona drift