question

active

question:is-the-assistant-axis-formed-during-post-training-or-inherited-from-representations-learned-during-pre-training

Is the Assistant Axis formed during post-training or inherited from representations learned during pre-training?

Motivates the base model steering experiments in §3.2.2

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Papers (1)

paper

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
associated_with

Claims (1)

claim

The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-training
answered_by
Key mechanistic claim about the developmental origin of the Assistant persona

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.828
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
Assistant Axisframework0.762
Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.758
Central interpretive claim and motivation for future work
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.757
Key mechanistic claim about persona dynamics
We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpushypothesis0.755
Motivated by near-identical PCs for base and instruct Gemma
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.755
Proposed future application of the Assistant Axis
Post-training alignmentconcept0.751
Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
Post-training is key to eliciting introspective awarenessfinding0.745
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.