method
active
method:linear-probe

Linear Probe

Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Guillaume Alain
    introduces
    Introduced linear probes as thermometers for neural representations; foundational work cited for probe methodology

Concepts (5)

concept
  • Residual Stream
    associated_with
    Proposed pathway flowing through layers at each position; calculates K/V values that feed horizontal information flow.
  • Truth Direction
    implements
    A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
  • Directions in activation space associated with contrastive emotive concept pairs studied in this paper as targets for introspection
  • Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
  • Probes
    implements
    Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.

Methods (7)

method
  • Linear Probing
    related_to
    Used to evaluate representation quality across VTAB tasks
  • Method for fitting a linear classifier on collected activations to predict task-relevant features
  • Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
  • Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
  • Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
  • Kim et al. 2018 method for identifying concept directions in CNN activations; precursor to LLM probing
  • Prior-work method for selecting the optimal layer for truth probing by maximizing class separability.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • linearityconcept0.775
    The sequential, continuous order of text, often challenged by diagrammatic branching.
  • Linear Map (a ⊸ b)framework0.771
    Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.
  • The idea that features are encoded as directions in activation space.
  • Linear Decodingmethod0.756
    Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
  • linear directionconcept0.751
    A straight vector in activation space, traditionally used for concept manipulation; claimed to be insufficient when true concept geometry is curved.
  • linear steeringmethod0.749
    Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
  • Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
  • Probe method combining causal interventions and structural analysis, supported by pyvene's activation collection