Variational Family V for Alignment Maps

Generalised notion restricting alignment maps to a family V; linearity is special case

Neighborhood — ranked by edge-count

concept

Alignment Map (ϕ)
extends
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Identity Alignment Map (ϕ_id)method0.759
Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
Non-Linear Alignment Map (ϕ_nonlin)method0.751
Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
Alignmentconcept0.747
The goal of making model behavior match human values and intentions, often addressed during post-training.
Non-linear alignment map ϕ_nonlin achieves near-optimal IIA across all layers on hierarchical equality task, eliminating layer-dependent degradation seen with linear mapsfinding0.744
Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
Representational Alignmentconcept0.744
Measure of similarity between the similarity structures (kernels) induced by two different representations
Linear Alignment Map (ϕ_lin)method0.742
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
configurational variationconcept0.741
The vast variety of shapes and sizes in morphogenetic living forms, impossible under blueprint planning.
Alignment Typeconcept0.736
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy