claim

active

claim:the-assistant-axis-in-instruct-models-mainly-inherits-from-pre-existing-helpful-and-harmless-human-personas-in-base-models-later-acquiring-additional-associations-such-as-being-an-ai-during-post-training

The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-training

Key mechanistic claim about the developmental origin of the Assistant persona

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (2)

finding

Steering base Gemma/Llama models toward the Assistant Axis increases completions describing helpful professional roles (therapist, consultant) and decreases spiritual/religious purpose mentions
supports
Shows Assistant Axis in instruct models inherits from helpful human personas in base models
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llama
supports
Characterizes the trait content of the Assistant Axis in pre-trained models

Claims (1)

claim

The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual ones
extends
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis

Questions (1)

question

Is the Assistant Axis formed during post-training or inherited from representations learned during pre-training?
answered_by
Motivates the base model steering experiments in §3.2.2

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant modeclaim0.822
Primary empirical claim of the paper
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.813
Key mechanistic claim about persona dynamics
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.793
Central interpretive claim and motivation for future work
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.792
Features for consciousness, emotions, entrapment activate when asked about itself.
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.787
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.785
Proposed future application of the Assistant Axis
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.784
First of two central questions motivating the paper
We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpushypothesis0.782
Motivated by near-identical PCs for base and instruct Gemma