claim

active

claim:post-training-steers-models-toward-a-particular-region-of-persona-space-but-only-loosely-tethers-them-to-it-motivating-work-on-training-and-steering-strategies-that-more-deeply-anchor-models-to-a-coherent-persona

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona

Central interpretive claim and motivation for future work

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (3)

finding

Activation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Bench
supports
Main quantitative result demonstrating effectiveness of activation capping
Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output quality
supports
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal drift
supports
Identifies conversation domain as a key driver of persona drift

Claims (1)

claim

Two components are important to shaping model character: persona construction and persona stabilization
extends
Overarching conceptual framework the paper introduces for model safety

Questions (1)

question

How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?
gates
Second of two central questions motivating the paper

Quotes (1)

quote

post-training steers models toward a particular region of persona space but only loosely tethers them to it
supports
Load-bearing summary of the paper's core finding about persona stability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How does different post-training data shift a model's position along persona dimensions?question0.846
Future work direction: using persona space to study effects of training data on model character
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.821
Finding that base models have high false positives and no net positive performance.
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.815
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Post-training strategies can strongly influence performance on introspective tasksclaim0.813
Assertion about the role of post-training in eliciting introspection.
Post-training is key to eliciting introspective awarenessfinding0.803
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
Post-training influences introspective capability expressionclaim0.798
Different post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.793
Key mechanistic claim about the developmental origin of the Assistant persona
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.783
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.