question
active
question:is-the-assistant-axis-formed-during-post-training-or-inherited-from-representations-learned-during-pre-trainingIs the Assistant Axis formed during post-training or inherited from representations learned during pre-training?
Motivates the base model steering experiments in §3.2.2
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Key mechanistic claim about the developmental origin of the Assistant persona
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
- Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper
- Central interpretive claim and motivation for future work
- Key mechanistic claim about persona dynamics
- Motivated by near-identical PCs for base and instruct Gemma
- Proposed future application of the Assistant Axis
- Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.