institute
active
institute:alignment-research-center

Alignment Research Center

AI alignment organization developing interpretability methods relevant to consciousness detection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Alignmentconcept0.789
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Alignment Problemconcept0.756
    The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
  • Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • Alignment Functionconcept0.741
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Alignment Typeconcept0.737
    The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
  • HHH training framework that Claude was trained with prior to experiments
  • AI alignmentconcept0.727
    Field within which this work has implications for evaluating alignment progress.