concept
active
concept:reflection-in-llmsReflection in LLMs
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
Neighborhood — ranked by edge-count
Claims (1)
claim
- Connection between reflection inhibition and jailbreak attack mechanisms.
Methods (1)
method
- Chain-of-thought promptingassociated_withTechnique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.
Concepts (2)
concept
- LLM Meta-Cognitionassociated_withThe ability of LLMs to monitor and evaluate their own reasoning, closely related to reflection.
- Situational Reflectionassociated_withThe specific form of reflection studied, where a model reflects on reasoning generated by another source.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The underlying mechanism of self-reflection in reasoning LLMs is not yet well understoodquestion0.823Broad gap motivating the entire paper
- Open question motivating the entire paper; identified as not yet well understood
- Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments
- Core claim of ReflCtrl that a single direction captures and controls reflection
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- Related capability where LLMs correct their own outputs, studied via linear representations.
- The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
- High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.