institute
active
institute:alignment-research-centerAlignment Research Center
AI alignment organization developing interpretability methods relevant to consciousness detection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
- HHH training framework that Claude was trained with prior to experiments
- Field within which this work has implications for evaluating alignment progress.