Alignment

The goal of making model behavior match human values and intentions, often addressed during post-training.

Neighborhood — ranked by edge-count

paper

method

DPO
extends
Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

concept

AI alignment
related_to
Field within which this work has implications for evaluating alignment progress.
Alignment Map (ϕ)
related_to
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
Alignment Problem
related_to
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Alignment Function
related_to
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Alignment Type
related_to
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy

artifact

Diagrammatic Writing (2013)
mentions
The performative text/lecture on diagrammatic writing by Johanna Drucker, published by /ubu editions.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Aligned by Designconcept0.849
Paper's proposed strategy of instilling intrinsic moral cognition so AI remains aligned even as capabilities expand
Representational Alignmentconcept0.847
Measure of similarity between the similarity structures (kernels) induced by two different representations
Inner Alignmentconcept0.845
Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
Data-Centric Alignmentconcept0.833
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
How Do We Ensure Alignment Of Values Betweenquestion0.832
Alignment Function (AF)method0.828
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Ai Alignment Problemconcept0.826
Deliberative Alignmentframework0.819
OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring