claim

active

claim:the-model-s-position-along-the-assistant-axis-depends-most-strongly-on-the-most-recent-user-message-rather-than-where-it-was-previously-in-the-conversation

The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversation

Key mechanistic claim about persona dynamics

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (1)

finding

User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10
supports
Shows model persona position is primarily determined by the most recent user message, not prior drift

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.813
Key mechanistic claim about the developmental origin of the Assistant persona
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.798
Proposed future application of the Assistant Axis
The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.787
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant modeclaim0.785
Primary empirical claim of the paper
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.778
First of two central questions motivating the paper
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.775
Motivating hypothesis for Section 5's investigation of prompt template effects.
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llamafinding0.775
Characterizes the trait content of the Assistant Axis in pre-trained models
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.774
Second of two central questions motivating the paper