concept
active
concept:internal-conflict-in-ai

Internal conflict in AI

Feature representing dilemmas, inner conflict; used to correct deceptive behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • The model's internal representation of uncertainty hypothesized to trigger self-reflection
  • Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
  • AI alignmentconcept0.739
    Field within which this work has implications for evaluating alignment progress.
  • AI Introspectionconcept0.736
    Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
  • Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
  • Epistemic Internalismframework0.723
    The view that epistemic justification is fully determined by factors internal to the subject's mind, often linked to consciousness.