Latent SOO Metric

Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers

Neighborhood — ranked by edge-count

paper

concept

Self-Other Overlap
implements
The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts

claim

Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agents
supports
Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior

2-hop · via this method's ideas

Where ideas in this method connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

latent methodsconcept0.766
Methods that use latent reasoning; lack task generalization and are difficult to train with autoregressive parallelization.
latent patternsconcept0.760
Statistical regularities stored in pretrained models.
latent reasoningconcept0.756
Reasoning approach using learnable hidden embeddings.
Latent Stitchmethod0.756
Baseline method using a single orthogonal matrix trained to map source latents to target latents via CL auxiliary loss without behavioral objective.
SAE Latentsconcept0.756
Interpretable features extracted by sparse autoencoders used as steering targets in this study
Latent-Space Representationsconcept0.754
Substrate on which causal emergence was computed across agent lifetimes; aligned with learning success.
Latent Structuresconcept0.745
Hidden or underdeveloped structures existing 'between the lines' of a configuration that can be enhanced and developed through harmony-seeking computation.
Latent Variables in Distributed Abstractionconcept0.737
Output of alignment map ϕ applied to DNN hidden states; basis for distributed causal abstraction