AI Assistant Persona

The default helpful, honest, and harmless character that post-trained LLMs are taught to embody

Neighborhood — ranked by edge-count

concept

Post-Training
associated_with
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.808
Features for consciousness, emotions, entrapment activate when asked about itself.
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.784
First of two central questions motivating the paper
alternative user personasconcept0.779
Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
Most AI assistants are anti-Alexander by design—they perform helpfulness, show work, and list options rather than resolving into calm.claim0.775
The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situationsclaim0.772
Interpretive claim about how the Assistant persona is structured in activation space
Autonomous AI Systemsconcept0.771
Future AI that may be rational, autonomous, and possibly conscious but lack affective consciousness.
Radical AIinstitute0.760
Partner organization with Goodfire on materials discovery research; partnership announced July 2025.
Agentic AI Systemsconcept0.755
Higher-level systems built on top of LLMs that produce and consume representations beyond next-token prediction; proposed as potential candidates for consciousness.