method
active
method:logit-based-self-report

Logit-based self-report

Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Showed finer-grained scalar judgments can be extracted from token distributions; motivated logit-based self-report method

Frameworks (1)

framework
  • The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Concepts (3)

concept
  • Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
  • Monitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
  • Full distribution over tokens 0-9 at first generation step; contains more information than any single sampled token

Methods (2)

method
  • Primary tool in human psychometrics for tracking latent internal states; adapted as the core measure in this paper for LLMs
  • Baseline self-report method selecting highest-probability token; shown to collapse to few uninformative values

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Self-reportconcept0.827
    The model's verbal description of its internal state, which may be accurate or confabulated.
  • Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
  • Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
  • Logit Lensmethod0.747
    Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
  • Selfingconcept0.746
    Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
  • Technique of eliciting and interpreting AI self-reports to assess internal states; discussed as promising but challenging.
  • The capacity of Kimi K2.5 to evaluate its own internal emotional state when steered, used as a novel interpretability signal
  • The epistemological core of Alexander's method: the human observer's inner state is a reliable, replicable measuring device for objective properties of the external world