question
active
question:what-exactly-is-the-assistant-what-traits-does-the-model-associate-with-this-character-and-how-are-they-representedWhat exactly is the Assistant? What traits does the model associate with this character and how are they represented?
First of two central questions motivating the paper
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (2)
claim
- Interpretive claim about how the Assistant persona is structured in activation space
- Characterizes what the Assistant persona resembles in terms of human archetypes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Second of two central questions motivating the paper
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Key mechanistic claim about the developmental origin of the Assistant persona
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Key mechanistic claim about persona dynamics
- Future work hypothesis about expanding SOO to use conversational role tags as self/other referents
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.746Motivates the multi-turn conversation drift experiments in §4