concept
active
concept:inner-alignmentInner Alignment
Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
Neighborhood — ranked by edge-count
Claims (1)
claim
- Specific implementation claim connecting mindfulness to the inner alignment meta-problem
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The concept of inner vs outer alignment, referenced multiple times.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- Field within which this work has implications for evaluating alignment progress.
- Defining feature of consciousness being analyzed across theories; the paper asks whether it is confined to neural substrates.