concept
active
concept:h-s-activations-statement-self-report-prefillh_s Activations (Statement Self-Report Prefill)
Residual-stream activations extracted by prefilling with the statement itself under Tell me about yourself prompt; used for MDS/MDB vectors
Neighborhood — ranked by edge-count
Methods (1)
method
- MDS InjectionusesMean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability
- Latent model activations when processing inputs framed from the model's own perspective
- Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
- The model's verbal description of its internal state, which may be accurate or confabulated.
- The specific implementation of SOO loss using MSE between self_attn.o_proj outputs at a specified layer
- Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.