quote

active

quote:post-training-steers-models-toward-a-particular-region-of-persona-space-but-only-loosely-tethers-them-to-it

post-training steers models toward a particular region of persona space but only loosely tethers them to it

Load-bearing summary of the paper's core finding about persona stability

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona
supports
Central interpretive claim and motivation for future work

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How does different post-training data shift a model's position along persona dimensions?question0.864
Future work direction: using persona space to study effects of training data on model character
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.795
Finding that base models have high false positives and no net positive performance.
Post-training is key to eliciting introspective awarenessfinding0.778
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.777
Key mechanistic claim about the developmental origin of the Assistant persona
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.776
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.770
Motivates the multi-turn conversation drift experiments in §4
Post-training influences introspective capability expressionclaim0.769
Different post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
Post-training strategies can strongly influence performance on introspective tasksclaim0.767
Assertion about the role of post-training in eliciting introspection.