concept
active
concept:alignment-problem

Alignment Problem

The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing

Neighborhood — ranked by edge-count

Claims (1)

claim

Concepts (2)

concept
  • Alignment
    related_to
    The goal of making model behavior match human values and intentions, often addressed during post-training.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Alignment Functionconcept0.830
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • AI alignmentconcept0.811
    Field within which this work has implications for evaluating alignment progress.
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • Alignment Typeconcept0.797
    The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
  • Alignment Map (ϕ)concept0.795
    The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
  • Inner Alignmentconcept0.787
    Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
  • The process of moving through configuration space towards a goal; self-organisation as navigation