concept
active
concept:refusal-direction-in-llmsRefusal Direction in LLMs
Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Refusal directionrelated_toArditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
- Latent Direction of Reflectionanalogous_toThe paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
- A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.752One of the three guiding research questions of the paper.
- Where does reliable, goal-directed behavior come from in LLMs if it is not explicitly programmed?question0.749Opening motivating question that UCCT attempts to answer through semantic anchoring
- Problem cited as a shortcoming of current LLMs; PRH predicts hallucinations should decrease with scale
- Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.