concept
active
concept:ai-assistant-personaAI Assistant Persona
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Post-Trainingassociated_withThe phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Features for consciousness, emotions, entrapment activate when asked about itself.
- What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.784First of two central questions motivating the paper
- Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
- Interpretive claim about how the Assistant persona is structured in activation space
- Future AI that may be rational, autonomous, and possibly conscious but lack affective consciousness.
- Partner organization with Goodfire on materials discovery research; partnership announced July 2025.
- Higher-level systems built on top of LLMs that produce and consume representations beyond next-token prediction; proposed as potential candidates for consciousness.