Variance-Matched Random Probe Comparison

Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions

Neighborhood — ranked by edge-count

Findings (2)

finding

Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)
introduces
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09
supports
Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal

Concepts (1)

concept

autoregressive persistence
about
Baseline persistence of any probe direction arising from the autoregressive nature of LLMs, not specific to emotion content

Methods (1)

method

Variance-matched random probe control
related_to
Control method sampling random directions from top-k PC spaces matched to emotion probe variance, to isolate emotion-specific persistence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Contrastive mean-difference probemethod0.769
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Random vector baselinemethod0.726
Baseline method sampling a random vector as feature direction for comparison with learned methods
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.723
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Logistic Regression Probemethod0.722
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.719
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Probe-Based Data Attributionmethod0.717
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Unsupervised Probingmethod0.716
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.711
Shows that truth representations are not reducible to text probability representations