concept
active
concept:alignment

Alignment

The goal of making model behavior match human values and intentions, often addressed during post-training.

Neighborhood — ranked by edge-count

Methods (1)

method
  • DPO
    extends
    Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

Concepts (5)

concept
  • AI alignment
    related_to
    Field within which this work has implications for evaluating alignment progress.
  • The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
  • The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
  • A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Alignment Type
    related_to
    The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy

Artifacts (1)

artifact
  • The performative text/lecture on diagrammatic writing by Johanna Drucker, published by /ubu editions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Aligned by Designconcept0.849
    Paper's proposed strategy of instilling intrinsic moral cognition so AI remains aligned even as capabilities expand
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • Inner Alignmentconcept0.845
    Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
  • Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
  • Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
  • OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring