Rejection sampling

A technique to filter model outputs; Redwood Research's project mentioned.

Neighborhood — ranked by edge-count

artifact

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Autoregressive Samplingmethod0.773
The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
Thompson Samplingmethod0.772
A Bayesian exploration strategy that samples from the posterior distribution over model parameters to decide actions.
Sampled-decoding self-reportmethod0.751
Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
Activation Interval Samplingmethod0.749
Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
refusal rateconcept0.746
The percentage of harmful requests that a model refuses to answer, a common safety metric.
Monte Carlo Cone Samplingmethod0.723
Procedure for sampling 64 random nonnegative combinations of cone basis vectors to evaluate the full cone distribution
Refusal directionconcept0.722
Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
Experience Sampling Method (ESM)method0.721
Human psychology method for repeated in-situ self-report; methodological inspiration for the paper's approach