Privileged self-access

Models predict their own hypothetical behavior better than other models can, demonstrating a form of privileged self-access per Binder et al. 2024

Neighborhood — ranked by edge-count

paper

thinker

Felix J. Binder
introduces
Demonstrated models predict their own behavior better than others (privileged self-access) and studied introspection in constrained settings

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Privileged Basisconcept0.810
A property of activations where neural network features align with basis dimensions due to sparse activation functions; absent in the residual stream but present in tokens, attention patterns, and MLP activations
Vulnerable Selfconcept0.780
The childlike, genuine human part of oneself needed to create true life.
Introspective Accessconcept0.760
The capacity to detect and report one's own internal states, measured via the five-adjective task and paradox reflection
Access consciousnessconcept0.737
Information available for reasoning, report, and decision-making.
Apparent Self-Awarenessconcept0.735
A dialogue agent using first-personal pronouns and expressing self-concern in ways that suggest consciousness but are actually role play
Privileged Bases Hypothesisconcept0.731
Hypothesis that neurons form privileged bases to encode information; consistent with constructive abstraction
Self-Other Distinctionconcept0.726
The implicit capacity the self-prior implements by assigning high density to familiar self-states and low density to non-self states
Self-attentionconcept0.726
A form of key-query attention within a single input sequence; core to Transformers.