hypothesis
active
hypothesis:using-assistant-user-tags-as-self-other-referents-could-leverage-generalization-properties-to-induce-larger-scale-changes-in-model-behaviorUsing 'assistant'/'user' tags as self/other referents could leverage generalization properties to induce larger-scale changes in model behavior
Future work hypothesis about expanding SOO to use conversational role tags as self/other referents
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Characterizes what the Assistant persona resembles in terms of human archetypes
- Key mechanistic claim about the developmental origin of the Assistant persona
- What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.756First of two central questions motivating the paper
- Future work hypothesis about extending SOO to direct value alignment
- Second of two central questions motivating the paper
- Characterizes the trait content of the Assistant Axis in pre-trained models
- Forward-looking claim about architectural generalizability of SOO
- Forward-looking claim about the potential of model introspection as an interpretability tool