Capability Term in SOO Loss

Additional term in RL SOO loss preserving agent capabilities analogous to KL term in RLHF

Neighborhood — ranked by edge-count

framework

Reinforcement Learning from Human Feedback (RLHF)
analogous_to
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO

method

SOO Loss Function
extends
A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.732
Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
Base Capabilityconcept0.711
A model's task-solving capability without harness evolution, used as baseline for comparing evolution gains
The scope of states that an agent can be stressed about defines its degree of cognitive capacity.claim0.703
Stress expands the spatial, temporal, and complexity scale of goals.
Can Principles For Predicting Properties And Capabilities Ofquestion0.703
Self-Other Overlap (SOO) Fine-Tuningframework0.699
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Loss of feelingconcept0.696
The deadening effect of modern processes that prevent people from acting according to their feeling for the whole, damaging the global whole.
We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.696
Formal definition of the paper's central construct
Capacity for care constitutes self in absence of permanent substanceclaim0.696
Levin and authors: self is defined by spatiotemporal scope and nature of goals pursued (cognitive light cone), not by immutable essence.