Alignment Function

A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables

Neighborhood — ranked by edge-count

framework

Distributed Alignment Search (DAS)
uses
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

concept

Alignment
related_to
The goal of making model behavior match human values and intentions, often addressed during post-training.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment Function (AF)method0.900
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Alignment Typeconcept0.835
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
Alignment Problemconcept0.830
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Alignment Map (ϕ)concept0.811
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
Representational Alignmentconcept0.809
Measure of similarity between the similarity structures (kernels) induced by two different representations
AI alignmentconcept0.808
Field within which this work has implications for evaluating alignment progress.
How Do We Ensure Alignment Of Values Betweenquestion0.795
Brute-Force Alignment Searchmethod0.789
Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.