method
active
method:linear-probingLinear Probing
Used to evaluate representation quality across VTAB tasks
Neighborhood — ranked by edge-count
Frameworks (2)
framework
- Linear Representation HypothesisimplementsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
- CausalGymusesMulti-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
Methods (1)
method
- Linear Proberelated_toSimple linear classifiers trained on model activations used as the probing technique within the introduced method.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method for fitting a linear classifier on collected activations to predict task-relevant features
- Nguyen et al. trained linear probes on activations to distinguish evaluation from deployment scenarios.
- Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
- Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
- Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
- The sequential, continuous order of text, often challenged by diagrammatic branching.
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- The idea that features are encoded as directions in activation space.