concept
active
concept:alignmentAlignment
The goal of making model behavior match human values and intentions, often addressed during post-training.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- DPOextendsPost-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.
Concepts (5)
concept
- AI alignmentrelated_toField within which this work has implications for evaluating alignment progress.
- Alignment Map (ϕ)related_toThe bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
- Alignment Problemrelated_toThe problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- Alignment Functionrelated_toA learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- Alignment Typerelated_toThe only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
Artifacts (1)
artifact
- Diagrammatic Writing (2013)mentionsThe performative text/lecture on diagrammatic writing by Johanna Drucker, published by /ubu editions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Paper's proposed strategy of instilling intrinsic moral cognition so AI remains aligned even as capabilities expand
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
- Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
- Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
- OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring