SOO Loss Function

A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning

Neighborhood — ranked by edge-count

paper

framework

Self-Other Overlap (SOO) Fine-Tuning
uses
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

concept

Capability Term in SOO Loss
extends
Additional term in RL SOO loss preserving agent capabilities analogous to KL term in RLHF
Other-Referencing Activations
about
Latent model activations when processing inputs framed from another agent's perspective
Self-Referencing Activations
about
Latent model activations when processing inputs framed from the model's own perspective

method

Mean Squared Error between self and other activations
uses
The specific implementation of SOO loss using MSE between self_attn.o_proj outputs at a specified layer

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Loss Functionconcept0.822
In machine learning, a function measuring the distance between current and desired output; analogous to stress.
Reward Functionconcept0.688
In RL, a scalar signal from the environment that defines the agent's goal; in active inference, reward is just another observation with associated preference.
Squared Difference Lossmethod0.684
Loss function used in both experiments: sum of squared differences between predicted and target grid
loss-scale balancingconcept0.682
Addressing disparity in loss magnitudes across tasks at the loss level
Functionconcept0.680
The practical, working aspect of a building; reinterpreted as the dynamic harmony of moving centers.
Well-Being Function f[w]method0.670
Extended subjective reward function proposed in this paper combining happiness with pain-belief signal
Soft Lossconcept0.663
Loss computed using continuous relaxations of logic gates during training
Counterfactual Latent (CL) Auxiliary Lossmethod0.655
Auxiliary objective combining L2 and cosine losses against pre-recorded CL vectors to improve causal relevance when one model is causally inaccessible.