method
active
method:logit-based-self-reportLogit-based self-report
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
Neighborhood — ranked by edge-count
Papers (1)
paper
Thinkers (1)
thinker
- Krystian ZawistowskistudiesShowed finer-grained scalar judgments can be extracted from token distributions; motivated logit-based self-report method
Frameworks (1)
framework
- The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering
Concepts (3)
concept
- Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
- Black-box internal state monitoringimplementsMonitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
- Digit-token logit distributionimplementsFull distribution over tokens 0-9 at first generation step; contains more information than any single sampled token
Methods (2)
method
- Numeric self-reportextendsPrimary tool in human psychometrics for tracking latent internal states; adapted as the core measure in this paper for LLMs
- Greedy-decoded self-reportcontradictsBaseline self-report method selecting highest-probability token; shown to collapse to few uninformative values
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The model's verbal description of its internal state, which may be accurate or confabulated.
- Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
- Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
- Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
- Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
- Technique of eliciting and interpreting AI self-reports to assess internal states; discussed as promising but challenging.
- The capacity of Kimi K2.5 to evaluate its own internal emotional state when steered, used as a novel interpretability signal
- The epistemological core of Alexander's method: the human observer's inner state is a reliable, replicable measuring device for objective properties of the external world