Self-Referencing Activations

Latent model activations when processing inputs framed from the model's own perspective

Neighborhood — ranked by edge-count

method

SOO Loss Function
about
A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning

concept

Other-Referencing Activations
associated_with
Latent model activations when processing inputs framed from another agent's perspective

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-Referential Prompting Protocolmethod0.779
The specific four-step prompting protocol (induction, continuation, experiential query, classification) used in Experiment 1
h_s Activations (Statement Self-Report Prefill)concept0.779
Residual-stream activations extracted by prefilling with the statement itself under Tell me about yourself prompt; used for MDS/MDB vectors
Self-Referential Processingconcept0.779
The central experimental manipulation: directing a model to attend to its own cognitive activity
Mean Squared Error between self and other activationsmethod0.775
The specific implementation of SOO loss using MSE between self_attn.o_proj outputs at a specified layer
Self-Referential Processing Induction Promptmethod0.771
The minimal prompt directing models to 'focus on any focus itself' without invoking consciousness vocabulary; the main experimental manipulation
Activationsconcept0.766
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Selfingconcept0.762
Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
Activation Similarityconcept0.756
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset