h_s Activations (Statement Self-Report Prefill)

Residual-stream activations extracted by prefilling with the statement itself under Tell me about yourself prompt; used for MDS/MDB vectors

Neighborhood — ranked by edge-count

method

MDS Injection
uses
Mean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

h_b Activations (Yes/No Binary Prefill)concept0.795
Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability
Self-Referencing Activationsconcept0.779
Latent model activations when processing inputs framed from the model's own perspective
h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activationsclaim0.725
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
Activationsconcept0.720
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activationsclaim0.710
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
Self-reportconcept0.706
The model's verbal description of its internal state, which may be accurate or confabulated.
Mean Squared Error between self and other activationsmethod0.705
The specific implementation of SOO loss using MSE between self_attn.o_proj outputs at a specified layer
Prefill detection taskmethod0.694
Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.