concept
active
concept:contrastive-system-prompt-completionsContrastive system prompt completions
Training method for probes: generate completions under opposing system prompts to induce positive and negative poles of a concept
Neighborhood — ranked by edge-count
Methods (1)
method
- Contrastive mean-difference probeimplementsProbe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supervised learning framework where system learns by observing contrast between current response and nudged improved response; requires weak additional forces from supervisor
- A sense of being complete and comfortable, as in the friendly house edge, that enhances life.
- LAT methodology step constructing paired prompts that elicit divergent behaviors to extract steering vectors
- The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
- Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart
- Unsupervised probe by Burns et al. to predict latent truth representations; cited as related but limited in generalization
- Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.705Optimization result for steering vector construction.
- Method for obtaining concept vectors by subtracting activations from two contrasting prompts.