Previous-turn probe score

Internal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question

Neighborhood — ranked by edge-count

Methods (1)

method

Contrastive mean-difference probe
uses
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts

Concepts (1)

concept

Probe score
extends
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe qualityclaim0.723
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.715
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Probesconcept0.707
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Steerability Score (Phi)concept0.705
Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
Pragmatic Turn in Cognitive Scienceconcept0.691
Philosophical move from rule-governed symbolic representation to action-oriented situational cognition
Pass Rate Scoringmethod0.688
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying stateclaim0.687
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
Logistic Regression Probemethod0.687
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication