method
active
method:rejection-sampling

Rejection sampling

A technique to filter model outputs; Redwood Research's project mentioned.

Neighborhood — ranked by edge-count

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • The mechanism by which LLMs generate text: drawing a token from the next-token distribution and appending it to context repeatedly
  • A Bayesian exploration strategy that samples from the posterior distribution over model parameters to decide actions.
  • Temperature=0.8 sampled decoding for self-report; reduces collapse moderately but remains discrete and noisy
  • Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
  • refusal rateconcept0.746
    The percentage of harmful requests that a model refuses to answer, a common safety metric.
  • Procedure for sampling 64 random nonnegative combinations of cone basis vectors to evaluate the full cone distribution
  • Refusal directionconcept0.722
    Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
  • Human psychology method for repeated in-situ self-report; methodological inspiration for the paper's approach