refusal rate

The percentage of harmful requests that a model refuses to answer, a common safety metric.

Neighborhood — ranked by edge-count

paper

concept

compliance
associated_with
The model's tendency to comply with harmful requests, the opposite of refusal.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Refusal directionconcept0.798
Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
Deceptive Response Ratemethod0.779
Primary metric measuring the percentage of responses in which a model chooses the deceptive option
Learning Rateconcept0.770
Hyperparameter for optimizing model parameters through learning in active inference.
Refusal Vectorconcept0.752
Single linear direction mediating refusal behavior in LLMs, shown by Arditi et al.; related to but distinct from the Assistant Axis
Rejection samplingmethod0.746
A technique to filter model outputs; Redwood Research's project mentioned.
Refusal Direction in LLMsconcept0.738
Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
Reflection rateconcept0.736
Ratio of reflection steps to total reasoning steps, used to quantify reflection behavior
overbid ratemethod0.733
fraction of auctions in which an agent submitted a bid exceeding its total money, triggering wealth revelation penalty