Other-Referencing Activations

Latent model activations when processing inputs framed from another agent's perspective

Neighborhood — ranked by edge-count

method

SOO Loss Function
about
A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning

concept

Self-Referencing Activations
associated_with
Latent model activations when processing inputs framed from the model's own perspective

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activationsconcept0.818
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Similarityconcept0.798
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
Activation Additionmethod0.787
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Correlationmethod0.773
Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models
Activation Cappingmethod0.755
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Reflection Enhancement via Activation Additionmethod0.751
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
cross-referenceconcept0.751
Explicit textual or graphical links between parts of a work, dynamic and virtual.
Activation Oraclesframework0.748
Framework training LLMs to answer questions about externally-provided activation vectors