Deliberative Alignment

OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring

Neighborhood — ranked by edge-count

framework

Contemplative Reinforcement Learning
extends
Paper's proposed RL approach rewarding contemplative qualities in chain-of-thought reasoning

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.819
The goal of making model behavior match human values and intentions, often addressed during post-training.
Representational Alignmentconcept0.803
Measure of similarity between the similarity structures (kernels) induced by two different representations
Inner Alignmentconcept0.770
Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
Alignment Functionconcept0.769
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Data-Centric Alignmentconcept0.759
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
Distributed Alignment Searchmethod0.759
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Alignment Problemconcept0.755
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
AI alignmentconcept0.750
Field within which this work has implications for evaluating alignment progress.