concept
active
concept:previous-turn-probe-scorePrevious-turn probe score
Internal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question
Neighborhood — ranked by edge-count
Methods (1)
method
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Concepts (1)
concept
- Probe scoreextendsDot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
- Philosophical move from rule-governed symbolic representation to action-oriented situational cognition
- Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
- Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication