Alignment Function (AF)

Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.

Neighborhood — ranked by edge-count

paper

framework

Model Alignment Search (MAS)
uses
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment Functionconcept0.900
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Alignmentconcept0.828
The goal of making model behavior match human values and intentions, often addressed during post-training.
Brute-Force Alignment Searchmethod0.802
Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
Alignment Problemconcept0.777
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Alignment Typeconcept0.777
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
RLHF Alignmentconcept0.767
Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
Inner alignment frameworkframework0.767
The concept of inner vs outer alignment, referenced multiple times.
Alignment Map (ϕ)concept0.767
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied