Impulsivity probe (impulsive vs. planning)

One of four emotive concept probes trained; contrastive pair impulsive/planning with best layer 13 in LLaMA-3.2-3B

Neighborhood — ranked by edge-count

method

Contrastive mean-difference probe
implements
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts

concept

Emotive states in LLMs
implements
Directions in activation space associated with contrastive emotive concept pairs studied in this paper as targets for introspection

finding

Impulsivity probe: peak Cohen's d=3.60 (layer 13), p=3.58×10⁻¹³ in LLaMA-3.2-3B
supports
Strongest probe validation result; highest Cohen's d among the four concepts

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Focus probe (distracted vs. focused)concept0.757
One of four emotive concept probes trained; contrastive pair distracted/focused with best layer 10 in LLaMA-3.2-3B
Probesconcept0.752
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.746
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Interest probe (bored vs. interested)concept0.736
One of four emotive concept probes trained; contrastive pair bored/interested with best layer 14 in LLaMA-3.2-3B
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.735
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
A question anywhere along the line that elicits a premature attempt at an answer could neutralize the remainder of the process into rationalization.hypothesis0.733
About chain-of-thought and process safety.
Unsupervised Probingmethod0.729
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Impulsivity→interest: ρ increases from 0.70 (α=-4) to 0.83 (α=+4); R² from 0.46 to 0.69 in LLaMA-3.2-3Bfinding0.727
Scatter plot visualization showing strengthened probe-report relationship across alpha range