hypothesis

active

hypothesis:we-hypothesize-that-axes-of-persona-differentiation-within-llms-are-likely-already-present-in-base-models-and-inherited-from-the-pre-training-corpus

We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpus

Motivated by near-identical PCs for base and instruct Gemma

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pair
supports
Shows persona space axes are inherited from pre-training, not solely created by post-training

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How does different post-training data shift a model's position along persona dimensions?question0.800
Future work direction: using persona space to study effects of training data on model character
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.782
Key mechanistic claim about the developmental origin of the Assistant persona
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.782
Limitation question motivating future work on persona elicitation strategies
The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant modeclaim0.781
Primary empirical claim of the paper
It is plausible that ongoing developments in LLMs may lead to models or agentic systems built on LLMs capable of generating representations observed with 'consciousness' phenomena.claim0.774
Forward-looking claim suggesting the methodological framework is relevant for future AI systems beyond current LLMs.
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.773
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
With an LLM-based dialogue agent, it is role play all the way down — there is no such thing as the true authentic voice of the base modelclaim0.767
The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.765
Antra's foundational claim about how introspection arises computationally rather than from memorised text.