concept
active
concept:probes

Probes

Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.

Neighborhood — ranked by edge-count

Methods (2)

method
  • Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
  • Linear Probe
    implements
    Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
  • Probing Methodsmethod0.826
    Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
  • Probe scoreconcept0.820
    Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
  • Activation Probingconcept0.799
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Sparse Probingmethod0.794
    Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
  • Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
  • The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets
  • One of four emotive concept probes trained; contrastive pair distracted/focused with best layer 10 in LLaMA-3.2-3B