question

active

question:how-does-different-post-training-data-shift-a-model-s-position-along-persona-dimensions

How does different post-training data shift a model's position along persona dimensions?

Future work direction: using persona space to study effects of training data on model character

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Papers (1)

paper

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

post-training steers models toward a particular region of persona space but only loosely tethers them to itquote0.864
Load-bearing summary of the paper's core finding about persona stability
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.846
Central interpretive claim and motivation for future work
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.806
Limitation question motivating future work on persona elicitation strategies
We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpushypothesis0.800
Motivated by near-identical PCs for base and instruct Gemma
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.770
The motivating question that opens the paper and leads to the development of manifold steering.
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.770
Key mechanistic claim about the developmental origin of the Assistant persona
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.769
Motivates the multi-turn conversation drift experiments in §4
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.766
Finding that base models have high false positives and no net positive performance.