finding

active

finding:user-message-embeddings-predict-subsequent-model-assistant-axis-projection-with-r2-0-53-0-77-p-0-001-but-predict-delta-from-previous-response-with-only-r2-0-10

User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10

Shows model persona position is primarily determined by the most recent user message, not prior drift

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversation
supports
Key mechanistic claim about persona dynamics

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bfinding0.795
Shows that deviation from Assistant persona predicts downstream harmful behavior
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.784
Proposed future application of the Assistant Axis
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.781
Table 2, row 3, showing equivalence when prior preferences match rewards.
After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.780
Demonstrates Assistant attractor dynamics in practice
25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activationfinding0.767
Calibration finding for choosing the activation cap threshold
Pairwise correlation of role loadings on PC1 exceeds 0.92 across all model pairs, indicating remarkably high similarity of the Assistant Axis across Gemma, Qwen, and Llamafinding0.763
Shows the leading component of persona space is model-universal
Default Assistant activation projects to one extreme of PC1 with minimum distance to edge of 0.03, while projecting to intermediate values (0.27-0.50) on all other PCsfinding0.760
Empirically confirms PC1 measures similarity to the Assistant persona
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.756
Establishes generalizability of the core difficulty-boundary finding across model families.