concept
active
concept:alternative-user-personasalternative user personas
Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- steering vectorsassociated_withA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- The principle that residents should directly determine the shape and character of their own housing.
- The process of building a coherent model persona from character archetypes and traits during training
- Speaking style induced by extreme steering away from the Assistant; characterized by mystical, poetic, theatrical prose
- Bibliographical element: an optional text path that splits from the main line, potential for infinite proliferation.
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
- Keeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors