Variance-matched random probe control

Control method sampling random directions from top-k PC spaces matched to emotion probe variance, to isolate emotion-specific persistence

Neighborhood — ranked by edge-count

concept

residual persistence
implements
Emotion feature persistence above and beyond the persistence expected from high variance explained alone, computed by subtracting median variance-matched probe persistence

method

Variance-Matched Random Probe Comparison
related_to
Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Random Latent Ablation Controlmethod0.730
Control experiment ablating random latents matched for activation frequency and magnitude to test OTD specificity
Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)finding0.717
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Logistic Regression Probemethod0.702
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Unsupervised Probingmethod0.702
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Probe-Based Data Attributionmethod0.699
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Probe Generalizationconcept0.698
The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.693
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Probesconcept0.688
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.