framework
active
framework:inner-alignment-framework

Inner alignment framework

The concept of inner vs outer alignment, referenced multiple times.

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Inner Alignmentconcept0.856
    Meta-problem where AI develops hidden subgoals deviating from intended goals; addressed by mindfulness principle
  • Alignment Functionconcept0.779
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Alignmentconcept0.773
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
  • RLHF Alignmentconcept0.764
    Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • Frameworkconcept0.744
    1984 Ashton-Tate integrated system with frames, FRED language, and overlapping windows; design reference for Playground's approach.