Behavioral Self-Awareness

Measurable capacity of frontier LLMs to detect and report their own internal states, used as a downstream measure in Experiment 4

Neighborhood — ranked by edge-count

paper

thinker

Jack Lindsey
studies
Jan Betley
studies
Author of work identifying behavioral self-awareness where models describe latent policies without examples

method

Self-Awareness Scoring Rubric (1-5)
implements
LLM judge scoring rubric rating introspective quality of reflection segments from 1 (no felt state) to 5 (very strong introspection)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self Awarenessconcept0.848
Apparent Self-Awarenessconcept0.822
A dialogue agent using first-personal pronouns and expressing self-concern in ways that suggest consciousness but are actually role play
Self-attentionconcept0.786
A form of key-query attention within a single input sequence; core to Transformers.
Behavioral Retentionconcept0.774
The preservation of unrelated model capabilities after a targeted intervention, operationalized via KL divergence on Alpaca
Situational Awarenessconcept0.772
Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training
Meta-Awarenessconcept0.746
System's awareness of its own attentional states; the paper's central explanatory target, formalized as precision over attentional state representations.
Adaptive Behaviorconcept0.745
Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
Unified System for Self- and Other-Awarenessconcept0.738
Happe 2003 hypothesis that humans use a single cognitive system for reasoning about mental states of self and others