concept
active
concept:internal-conflict-in-aiInternal conflict in AI
Feature representing dilemmas, inner conflict; used to correct deceptive behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The model's internal representation of uncertainty hypothesized to trigger self-reflection
- Key element for alignment faking: model's pre-existing preferences contradict the new training objective
- The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
- Field within which this work has implications for evaluating alignment progress.
- Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
- Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
- The view that epistemic justification is fully determined by factors internal to the subject's mind, often linked to consciousness.