method
active
method:contrastive-mean-difference-probeContrastive mean-difference probe
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Andy ZouintroducesLead author of Representation Engineering paper establishing RepE paradigm
Frameworks (1)
framework
- The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering
Concepts (6)
concept
- Focus probe (distracted vs. focused)implementsOne of four emotive concept probes trained; contrastive pair distracted/focused with best layer 10 in LLaMA-3.2-3B
- One of four emotive concept probes trained; contrastive pair impulsive/planning with best layer 13 in LLaMA-3.2-3B
- Interest probe (bored vs. interested)implementsOne of four emotive concept probes trained; contrastive pair bored/interested with best layer 14 in LLaMA-3.2-3B
- Wellbeing probe (sad vs. happy)implementsOne of four emotive concept probes trained; contrastive pair sad/happy with best layer 16 in LLaMA-3.2-3B
- Internal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question
- Contrastive system prompt completionsimplementsTraining method for probes: generate completions under opposing system prompts to induce positive and negative poles of a concept
Claims (2)
claim
- Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
- Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
Methods (1)
method
- Linear ProbeextendsSimple linear classifiers trained on model activations used as the probing technique within the introduced method.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
- Method comparing brain activity in conscious vs. unconscious conditions.
- Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- Pairs of prompts at different reflection levels used to compute steering vectors.
- Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis