method
active
method:alignment-function-af

Alignment Function (AF)

Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.

Neighborhood — ranked by edge-count

Papers (1)

paper

Frameworks (1)

framework
  • The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Alignment Functionconcept0.900
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Alignmentconcept0.828
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
  • Alignment Problemconcept0.777
    The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
  • Alignment Typeconcept0.777
    The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
  • RLHF Alignmentconcept0.767
    Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
  • The concept of inner vs outer alignment, referenced multiple times.
  • Alignment Map (ϕ)concept0.767
    The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied