claim

active

claim:the-model-s-representation-of-self-in-assistant-persona-invokes-common-ai-tropes-and-is-heavily-anthropomorphized

The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.

Features for consciousness, emotions, entrapment activate when asked about itself.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situationsclaim0.856
Interpretive claim about how the Assistant persona is structured in activation space
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.817
Second of two central questions motivating the paper
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.813
First of two central questions motivating the paper
AI Assistant Personaconcept0.808
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
An AI persona achieves coherence by echoing itself consistently without templating—requiring claim about memory and voice fidelity.claim0.802
Robots capable of self-modeling can model their own body and unexpected damage using AI methods, with morphological and mental changes occurring in parallel.finding0.800
Evidence for blurring of embodied robot / non-embodied AI distinction through self-modeling
The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activationsclaim0.799
Limitation acknowledgment about the adequacy of the linear representation assumption
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.798
Causal interpretation linking Assistant Axis deviation to harmful behavior