concept
active
concept:refusal-direction-in-llms

Refusal Direction in LLMs

Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
  • The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.