Internal conflict in AI

Feature representing dilemmas, inner conflict; used to correct deceptive behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Internal uncertaintyconcept0.752
The model's internal representation of uncertainty hypothesized to trigger self-reflection
Preference Conflictconcept0.748
Key element for alignment faking: model's pre-existing preferences contradict the new training objective
internal emotional stateconcept0.744
The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
Flow states in AI correspond to centers in internal state that 'work' well.claim0.741
AI alignmentconcept0.739
Field within which this work has implications for evaluating alignment progress.
AI Introspectionconcept0.736
Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
Internality Criterionconcept0.734
Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
Epistemic Internalismframework0.723
The view that epistemic justification is fully determined by factors internal to the subject's mind, often linked to consciousness.