Alignment Map (ϕ)

The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied

Neighborhood — ranked by edge-count

concept

Alignment
related_to
The goal of making model behavior match human values and intentions, often addressed during post-training.
Causal abstraction
uses
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Grokking
associated_with
Observed in IOI alignment map training where IIA stays low for many steps then quickly jumps
Latent Variables in Distributed Abstraction
associated_with
Output of alignment map ϕ applied to DNN hidden states; basis for distributed causal abstraction
Variational Family V for Alignment Maps
extends
Generalised notion restricting alignment maps to a family V; linearity is special case

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Identity Alignment Map (ϕ_id)method0.876
Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
Linear Alignment Map (ϕ_lin)method0.857
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Non-Linear Alignment Map (ϕ_nonlin)method0.834
Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
Alignment Functionconcept0.811
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Alignment Problemconcept0.795
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
AI alignmentconcept0.785
Field within which this work has implications for evaluating alignment progress.
Alignment Typeconcept0.779
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy
Representational Alignmentconcept0.774
Measure of similarity between the similarity structures (kernels) induced by two different representations