quote
active
quote:post-training-steers-models-toward-a-particular-region-of-persona-space-but-only-loosely-tethers-them-to-itpost-training steers models toward a particular region of persona space but only loosely tethers them to it
Load-bearing summary of the paper's core finding about persona stability
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central interpretive claim and motivation for future work
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- How does different post-training data shift a model's position along persona dimensions?question0.864Future work direction: using persona space to study effects of training data on model character
- Finding that base models have high false positives and no net positive performance.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Key mechanistic claim about the developmental origin of the Assistant persona
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.770Motivates the multi-turn conversation drift experiments in §4
- Different post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
- Assertion about the role of post-training in eliciting introspection.