framework
active
framework:quantitative-introspection-framework

Quantitative Introspection Framework

The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Neighborhood — ranked by edge-count

Methods (6)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
  • Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
  • Applied within concept/endpoint families to control false discovery rate across parallel tests
  • Bootstrap resampling at conversation level (B=1000, 95% percentile CIs) to respect non-independence of within-conversation observations
  • Primary statistical model with random intercept by conversation, REML estimation, for pooled conversation-turn observations

Concepts (1)

concept
  • Operational definition of introspection: self-report covaries monotonically with probe-defined direction AND causally shifting activations shifts the report in a semantically coherent way

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Introspectionconcept0.799
    The ability of a model to observe its own past internal states or computations; claimed to be architecturally permitted by transformers.
  • Tracking of functional/computational cognitive states, distinguished from phenomenal introspection.
  • Identified gap; methods for enabling machine consciousness development through self-examination.
  • Direct introspection into phenomenal consciousness; its correlation with functional introspection is an open question.
  • AI Introspectionconcept0.777
    Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
  • Prior framework claiming frontier LLMs can detect and name injected concepts, interpreted as nascent self-awareness
  • The capacity of a model to self-report on its internal emotional state when its SAE features are steered, used here as a measurement tool
  • Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric