alternative user personas

Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.

Neighborhood — ranked by edge-count

paper

concept

steering vectors
associated_with
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI Assistant Personaconcept0.779
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
User Participationconcept0.738
The principle that residents should directly determine the shape and character of their own housing.
Persona Constructionconcept0.734
The process of building a coherent model persona from character archetypes and traits during training
Mystical/Theatrical Personaconcept0.711
Speaking style induced by extreme steering away from the Assistant; characterized by mystical, poetic, theatrical prose
In each of us, a person is existing, or waiting to exist; the most free version of that person occasionally appears briefly.claim0.708
Branching alternativemethod0.702
Bibliographical element: an optional text path that splits from the main line, potential for infinite proliferation.
Persona Vectors (Chen et al.)framework0.700
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
Persona Stabilizationconcept0.697
Keeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors