method
active
method:contrastive-mean-difference-probe

Contrastive mean-difference probe

Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Andy Zou
    introduces
    Lead author of Representation Engineering paper establishing RepE paradigm

Frameworks (1)

framework
  • The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Concepts (6)

concept

Claims (2)

claim

Methods (1)

method
  • Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Contrastconcept0.788
    The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
  • Method comparing brain activity in conscious vs. unconscious conditions.
  • Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions
  • Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
  • Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart
  • Probesconcept0.754
    Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
  • Contrastive Pairsconcept0.748
    Pairs of prompts at different reflection levels used to compute steering vectors.
  • Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis