method
active
method:soo-loss-functionSOO Loss Function
A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Concepts (3)
concept
- Capability Term in SOO LossextendsAdditional term in RL SOO loss preserving agent capabilities analogous to KL term in RLHF
- Latent model activations when processing inputs framed from another agent's perspective
- Latent model activations when processing inputs framed from the model's own perspective
Methods (1)
method
- The specific implementation of SOO loss using MSE between self_attn.o_proj outputs at a specified layer
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- In machine learning, a function measuring the distance between current and desired output; analogous to stress.
- In RL, a scalar signal from the environment that defines the agent's goal; in active inference, reward is just another observation with associated preference.
- Loss function used in both experiments: sum of squared differences between predicted and target grid
- Addressing disparity in loss magnitudes across tasks at the loss level
- The practical, working aspect of a building; reinterpreted as the dynamic harmony of moving centers.
- Extended subjective reward function proposed in this paper combining happiness with pain-belief signal
- Loss computed using continuous relaxations of logic gates during training
- Auxiliary objective combining L2 and cosine losses against pre-recorded CL vectors to improve causal relevance when one model is causally inaccessible.