Logit-based self-report

Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal

Neighborhood — ranked by edge-count

Papers (1)

paper

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
introduces

Thinkers (1)

thinker

Krystian Zawistowski
studies
Showed finer-grained scalar judgments can be extracted from token distributions; motivated logit-based self-report method

Frameworks (1)

framework

Quantitative Introspection Framework
uses
The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Concepts (3)

concept

Introspective strength
uses
Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
Black-box internal state monitoring
implements
Monitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
Digit-token logit distribution
implements
Full distribution over tokens 0-9 at first generation step; contains more information than any single sampled token

Methods (2)

method

Numeric self-report
extends
Primary tool in human psychometrics for tracking latent internal states; adapted as the core measure in this paper for LLMs
Greedy-decoded self-report
contradicts
Baseline self-report method selecting highest-probability token; shown to collapse to few uninformative values

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-reportconcept0.827
The model's verbal description of its internal state, which may be accurate or confabulated.
Logit-based self-report unmasks introspective capacity that greedy decoding concealsclaim0.791
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Logit Weight Analysismethod0.757
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
Logit Lensmethod0.747
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
Selfingconcept0.746
Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
Self-Report Method for AI Introspectionmethod0.736
Technique of eliciting and interpreting AI self-reports to assess internal states; discussed as promising but challenging.
LLM Introspective Self-Reportconcept0.735
The capacity of Kimi K2.5 to evaluate its own internal emotional state when steered, used as a novel interpretability signal
Self as Measuring Instrumentconcept0.734
The epistemological core of Alexander's method: the human observer's inner state is a reliable, replicable measuring device for objective properties of the external world