Linear Probing

Used to evaluate representation quality across VTAB tasks

Neighborhood — ranked by edge-count

framework

Linear Representation Hypothesis
implements
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
CausalGym
uses
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

method

Linear Probe
related_to
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear Probe Trainingmethod0.847
Method for fitting a linear classifier on collected activations to predict task-relevant features
Linear Probe for Evaluation Awarenessmethod0.836
Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
Probing Methodsmethod0.796
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Diagnostic Probingmethod0.788
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Linear Decodingmethod0.786
Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
linearityconcept0.779
The sequential, continuous order of text, often challenged by diagrammatic branching.
Probesconcept0.765
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Linear representationconcept0.765
The idea that features are encoded as directions in activation space.