claim
active
claim:the-assistant-axis-in-instruct-models-mainly-inherits-from-pre-existing-helpful-and-harmless-human-personas-in-base-models-later-acquiring-additional-associations-such-as-being-an-ai-during-post-trainingThe Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-training
Key mechanistic claim about the developmental origin of the Assistant persona
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Findings (2)
finding
- Shows Assistant Axis in instruct models inherits from helpful human personas in base models
- Characterizes the trait content of the Assistant Axis in pre-trained models
Claims (1)
claim
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
Questions (1)
question
- Motivates the base model steering experiments in §3.2.2
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary empirical claim of the paper
- Key mechanistic claim about persona dynamics
- Central interpretive claim and motivation for future work
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
- Proposed future application of the Assistant Axis
- What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.784First of two central questions motivating the paper
- Motivated by near-identical PCs for base and instruct Gemma