Contrastive mean-difference probe

Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Andy Zou
introduces
Lead author of Representation Engineering paper establishing RepE paradigm

Frameworks (1)

framework

Quantitative Introspection Framework
uses
The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Concepts (6)

concept

Focus probe (distracted vs. focused)
implements
One of four emotive concept probes trained; contrastive pair distracted/focused with best layer 10 in LLaMA-3.2-3B
Impulsivity probe (impulsive vs. planning)
implements
One of four emotive concept probes trained; contrastive pair impulsive/planning with best layer 13 in LLaMA-3.2-3B
Interest probe (bored vs. interested)
implements
One of four emotive concept probes trained; contrastive pair bored/interested with best layer 14 in LLaMA-3.2-3B
Wellbeing probe (sad vs. happy)
implements
One of four emotive concept probes trained; contrastive pair sad/happy with best layer 16 in LLaMA-3.2-3B
Previous-turn probe score
uses
Internal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question
Contrastive system prompt completions
implements
Training method for probes: generate completions under opposing system prompts to induce positive and negative poles of a concept

Claims (2)

claim

The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe quality
extends
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or style
cites
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable

Methods (1)

method

Linear Probe
extends
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Contrastconcept0.788
The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
Contrastive analysismethod0.781
Method comparing brain activity in conscious vs. unconscious conditions.
Variance-Matched Random Probe Comparisonmethod0.769
Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.760
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Contrast-Consistent Searchmethod0.754
Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart
Probesconcept0.754
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Contrastive Pairsconcept0.748
Pairs of prompts at different reflection levels used to compute steering vectors.
Difference-in-Meansmethod0.741
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis