Recognition of self-generated outputs

Ability to distinguish one's own outputs from those of other models or humans; related to prefill detection.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Predictive Coding Account of Self-Recognitionconcept0.755
Apps and Tsakiris's proposal that self-recognition can be modeled probabilistically under the free energy principle
Intrinsic Self-Correction via Linear Representationsframework0.754
Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
Self-Referencing Activationsconcept0.751
Latent model activations when processing inputs framed from the model's own perspective
Self-Referential Processingconcept0.746
The central experimental manipulation: directing a model to attend to its own cognitive activity
Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.743
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
Detecting Unintended Outputs via Introspectionfinding0.739
Models can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
Sampled-decoding self-reportmethod0.738
Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
Self-reportconcept0.738
The model's verbal description of its internal state, which may be accurate or confabulated.