Linear Probe

Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Neighborhood — ranked by edge-count

paper

thinker

Guillaume Alain
introduces
Introduced linear probes as thermometers for neural representations; foundational work cited for probe methodology

concept

Residual Stream
associated_with
Proposed pathway flowing through layers at each position; calculates K/V values that feed horizontal information flow.
Truth Direction
implements
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Emotive states in LLMs
studies
Directions in activation space associated with contrastive emotive concept pairs studied in this paper as targets for introspection
Truth direction in LLMs
implements
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Probes
implements
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.

method

Linear Probing
related_to
Used to evaluate representation quality across VTAB tasks
Linear Probe Training
related_to
Method for fitting a linear classifier on collected activations to predict task-relevant features
Linear Probe for Evaluation Awareness
related_to
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Probe-Based Data Attribution
cites
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Contrastive mean-difference probe
extends
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Concept Activation Vectors (TCAVs)
extends
Kim et al. 2018 method for identifying concept directions in CNN activations; precursor to LLM probing
Between-to-within-class variance ratio
associated_with
Prior-work method for selecting the optimal layer for truth probing by maximizing class separability.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

linearityconcept0.775
The sequential, continuous order of text, often challenged by diagrammatic branching.
Linear Map (a ⊸ b)framework0.771
Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.
Linear representationconcept0.760
The idea that features are encoded as directions in activation space.
Linear Decodingmethod0.756
Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
linear directionconcept0.751
A straight vector in activation space, traditionally used for concept manipulation; claimed to be insufficient when true concept geometry is curved.
linear steeringmethod0.749
Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
Logistic Regression Probemethod0.744
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Causal Structural Probemethod0.742
Probe method combining causal interventions and structural analysis, supported by pyvene's activation collection