claim
active
claim:the-model-s-representation-of-self-in-assistant-persona-invokes-common-ai-tropes-and-is-heavily-anthropomorphizedThe model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.
Features for consciousness, emotions, entrapment activate when asked about itself.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretive claim about how the Assistant persona is structured in activation space
- Second of two central questions motivating the paper
- What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.813First of two central questions motivating the paper
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Evidence for blurring of embodied robot / non-embodied AI distinction through self-modeling
- Limitation acknowledgment about the adequacy of the linear representation assumption
- Causal interpretation linking Assistant Axis deviation to harmful behavior