Activation Interval Sampling

Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Autoregressive Samplingmethod0.767
The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
Activationsconcept0.759
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Statistical Activation Analysismethod0.749
Component of the contrastive retrieval pipeline analyzing activation statistics.
Rejection samplingmethod0.749
A technique to filter model outputs; Redwood Research's project mentioned.
Activation Cappingmethod0.735
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Thompson Samplingmethod0.731
A Bayesian exploration strategy that samples from the posterior distribution over model parameters to decide actions.
Activation Probingconcept0.727
Technique of reading out model beliefs from internal activations before the final answer token is generated
Threshold-like activation assumptionconcept0.725
Assumption that small anchor changes can produce sharp performance shifts when conditions are favorable.