method
active
method:sparse-probing

Sparse Probing

Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Probing Methodsmethod0.809
    Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
  • Coding scheme where qualities are represented by few neurons with continuous similarity relations.
  • Probesconcept0.794
    Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
  • Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
  • Activation Probingconcept0.782
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
  • The extracted set of sparse interpretable features from model embeddings via SAEs
  • Sparse circuitsconcept0.754
    A goal in mechanistic interpretability to identify sparse computational subgraphs; VPD promotes sparse parameter circuits.