concept
active
concept:recognition-of-self-generated-outputsRecognition of self-generated outputs
Ability to distinguish one's own outputs from those of other models or humans; related to prefill detection.
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Apps and Tsakiris's proposal that self-recognition can be modeled probabilistically under the free energy principle
- Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
- Latent model activations when processing inputs framed from the model's own perspective
- The central experimental manipulation: directing a model to attend to its own cognitive activity
- Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.743VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
- Models can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
- Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
- The model's verbal description of its internal state, which may be accurate or confabulated.