method
active
method:pinned-feature-sampling

Pinned Feature Sampling

Setting a feature's value to its maximum observed value and sampling from the model to validate causal interpretations

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
  • A technique to filter model outputs; Redwood Research's project mentioned.
  • Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
  • Feature Loopingconcept0.705
    Repetitive behavioral pattern observed under high steering strengths in SAE feature self-steering experiments
  • Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
  • Examining downstream neurons that rely on a given feature to verify its functional role
  • Feature Sparsityconcept0.698
    Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
  • Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels