Pinned Feature Sampling

Setting a feature's value to its maximum observed value and sampling from the model to validate causal interpretations

Neighborhood — ranked by edge-count

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Autoregressive Samplingmethod0.720
The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
Rejection samplingmethod0.713
A technique to filter model outputs; Redwood Research's project mentioned.
Feature Visualizationmethod0.707
Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
Feature Loopingconcept0.705
Repetitive behavioral pattern observed under high steering strengths in SAE feature self-steering experiments
Sampled-decoding self-reportmethod0.704
Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
Downstream Client Feature Analysismethod0.701
Examining downstream neurons that rely on a given feature to verify its functional role
Feature Sparsityconcept0.698
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
Activation Interval Samplingmethod0.695
Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels