method
active
method:variance-matched-random-probe-comparisonVariance-Matched Random Probe Comparison
Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions
Neighborhood — ranked by edge-count
Findings (2)
finding
- Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
- Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
Concepts (1)
concept
- Baseline persistence of any probe direction arising from the autoregressive nature of LLMs, not specific to emotion content
Methods (1)
method
- Variance-matched random probe controlrelated_toControl method sampling random directions from top-k PC spaces matched to emotion probe variance, to isolate emotion-specific persistence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
- Baseline method sampling a random vector as feature direction for comparison with learned methods
- Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
- Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
- Shows that truth representations are not reducible to text probability representations