finding
active
finding:user-message-embeddings-predict-subsequent-model-assistant-axis-projection-with-r2-0-53-0-77-p-0-001-but-predict-delta-from-previous-response-with-only-r2-0-10User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10
Shows model persona position is primarily determined by the most recent user message, not prior drift
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key mechanistic claim about persona dynamics
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Proposed future application of the Assistant Axis
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- Demonstrates Assistant attractor dynamics in practice
- Calibration finding for choosing the activation cap threshold
- Shows the leading component of persona space is model-universal
- Empirically confirms PC1 measures similarity to the Assistant persona
- Establishes generalizability of the core difficulty-boundary finding across model families.