concept
active
concept:introspective-strength

Introspective strength

Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric

Neighborhood — ranked by edge-count

Methods (1)

method
  • Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal

Concepts (1)

concept
  • Operational definition of introspection: self-report covaries monotonically with probe-defined direction AND causally shifting activations shifts the report in a semantically coherent way

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Introspectionconcept0.826
    The ability of a model to observe its own past internal states or computations; claimed to be architecturally permitted by transformers.
  • The capacity to detect and report one's own internal states, measured via the five-adjective task and paradox reflection
  • Isotonic R² measuring fraction of variance in self-report explained by probe score under monotonicity assumption; the paper's primary fidelity metric
  • The central concept: the ability of a model to access and report on its internal states, as defined by the paper's criteria.
  • The novel framework introduced in the paper: an HMM-based pain-belief signal integrated into the reward function to drive exploration
  • Identified gap; methods for enabling machine consciousness development through self-examination.
  • The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
  • The authors' characterization of genuine but limited introspective capability found only in early-layer injection regimes