Refusal Direction in LLMs

Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.

Neighborhood — ranked by edge-count

concept

Refusal direction
related_to
Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
Latent Direction of Reflection
analogous_to
The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Truth direction in LLMsconcept0.832
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Reflection in LLMsconcept0.775
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
LLM judge evaluationmethod0.757
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
Truth Direction in LLM Latent Spaceconcept0.757
A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.752
One of the three guiding research questions of the paper.
Where does reliable, goal-directed behavior come from in LLMs if it is not explicitly programmed?question0.749
Opening motivating question that UCCT attempts to answer through semantic anchoring
Hallucination in LLMsconcept0.745
Problem cited as a shortcoming of current LLMs; PRH predicts hallucinations should decrease with scale
Non-Linear Representations in LLMsconcept0.744
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.