Alignment Research Center

AI alignment organization developing interpretability methods relevant to consciousness detection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.789
The goal of making model behavior match human values and intentions, often addressed during post-training.
Alignment Problemconcept0.756
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Data-Centric Alignmentconcept0.754
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
Representational Alignmentconcept0.750
Measure of similarity between the similarity structures (kernels) induced by two different representations
Alignment Functionconcept0.741
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Alignment Typeconcept0.737
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
A General Language Assistant as a Laboratory for Alignment (Askell et al. 2021)concept0.732
HHH training framework that Claude was trained with prior to experiments
AI alignmentconcept0.727
Field within which this work has implications for evaluating alignment progress.