Aligned by Design

Paper's proposed strategy of instilling intrinsic moral cognition so AI remains aligned even as capabilities expand

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.849
The goal of making model behavior match human values and intentions, often addressed during post-training.
Alignment Typeconcept0.785
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
Alignment Problemconcept0.771
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Alignment Functionconcept0.770
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Aligned-MTLmethod0.765
Independent component alignment for multi-task learning.
How Do We Ensure Alignment Of Values Betweenquestion0.763
Alignment Function (AF)method0.761
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Alignment Map (ϕ)concept0.760
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied