Quantitative Introspection Framework

The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Neighborhood — ranked by edge-count

Papers (1)

paper

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
introduces

Methods (6)

method

Activation Steering
uses
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Contrastive mean-difference probe
uses
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Logit-based self-report
uses
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
Benjamini-Hochberg (BH) correction
implements
Applied within concept/endpoint families to control false discovery rate across parallel tests
Cluster bootstrap confidence intervals
implements
Bootstrap resampling at conversation level (B=1000, 95% percentile CIs) to respect non-independence of within-conversation observations
Linear mixed-effects models (LMMs)
implements
Primary statistical model with random intercept by conversation, REML estimation, for pooled conversation-turn observations

Concepts (1)

concept

Causal informational coupling
implements
Operational definition of introspection: self-report covaries monotonically with probe-defined direction AND causally shifting activations shifts the report in a semantically coherent way

Artifacts (1)

artifact

concept-probe Python library (github.com/mneuronico/concept-probe)
implements
Open-source Python library released with the paper supporting probe training, multi-probe scoring, activation steering, and logit extraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Introspectionconcept0.799
The ability of a model to observe its own past internal states or computations; claimed to be architecturally permitted by transformers.
Functional Introspectionconcept0.790
Tracking of functional/computational cognitive states, distinguished from phenomenal introspection.
Systematic Introspective Processesconcept0.787
Identified gap; methods for enabling machine consciousness development through self-examination.
Phenomenal Introspectionconcept0.781
Direct introspection into phenomenal consciousness; its correlation with functional introspection is an open question.
AI Introspectionconcept0.777
Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
Emergent Introspective Awareness Framework (Lindsey 2026)framework0.776
Prior framework claiming frontier LLMs can detect and name injected concepts, interpreted as nascent self-awareness
model introspectionconcept0.776
The capacity of a model to self-report on its internal emotional state when its SAE features are steered, used here as a measurement tool
Introspective strengthconcept0.774
Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric