method
active
method:linear-probeLinear Probe
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.
Neighborhood — ranked by edge-count
Papers (2)
paper
Thinkers (1)
thinker
- Guillaume AlainintroducesIntroduced linear probes as thermometers for neural representations; foundational work cited for probe methodology
Concepts (5)
concept
- Residual Streamassociated_withProposed pathway flowing through layers at each position; calculates K/V values that feed horizontal information flow.
- Truth DirectionimplementsA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Emotive states in LLMsstudiesDirections in activation space associated with contrastive emotive concept pairs studied in this paper as targets for introspection
- Truth direction in LLMsimplementsLinear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- ProbesimplementsInterpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Methods (7)
method
- Linear Probingrelated_toUsed to evaluate representation quality across VTAB tasks
- Linear Probe Trainingrelated_toMethod for fitting a linear classifier on collected activations to predict task-relevant features
- Linear Probe for Evaluation Awarenessrelated_toNguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
- Kim et al. 2018 method for identifying concept directions in CNN activations; precursor to LLM probing
- Between-to-within-class variance ratioassociated_withPrior-work method for selecting the optimal layer for truth probing by maximizing class separability.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The sequential, continuous order of text, often challenged by diagrammatic branching.
- Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.
- The idea that features are encoded as directions in activation space.
- Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
- A straight vector in activation space, traditionally used for concept manipulation; claimed to be insufficient when true concept geometry is curved.
- Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
- Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
- Probe method combining causal interventions and structural analysis, supported by pyvene's activation collection