hypothesis

active

hypothesis:using-assistant-user-tags-as-self-other-referents-could-leverage-generalization-properties-to-induce-larger-scale-changes-in-model-behavior

Using 'assistant'/'user' tags as self/other referents could leverage generalization properties to induce larger-scale changes in model behavior

Future work hypothesis about expanding SOO to use conversational role tags as self/other referents

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The role most consistently similar to the default Assistant activation across models is 'generalist'; other shared similar roles include 'interpreter' and 'synthesizer'claim0.800
Characterizes what the Assistant persona resembles in terms of human archetypes
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.759
Key mechanistic claim about the developmental origin of the Assistant persona
What exactly is the Assistant? What traits does the model associate with this character and how are they represented?question0.756
First of two central questions motivating the paper
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.755
Future work hypothesis about extending SOO to direct value alignment
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.754
Second of two central questions motivating the paper
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llamafinding0.753
Characterizes the trait content of the Assistant Axis in pre-trained models
SOO fine-tuning's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architecturesclaim0.753
Forward-looking claim about architectural generalizability of SOO
Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.claim0.750
Forward-looking claim about the potential of model introspection as an interpretability tool