Identity Alignment Map (ϕ_id)

Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis

Neighborhood — ranked by edge-count

paper

concept

Privileged Bases Hypothesis
implements
Hypothesis that neurons form privileged bases to encode information; consistent with constructive abstraction

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment Map (ϕ)concept0.876
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
Alignmentconcept0.807
The goal of making model behavior match human values and intentions, often addressed during post-training.
Linear Alignment Map (ϕ_lin)method0.800
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Non-Linear Alignment Map (ϕ_nonlin)method0.796
Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
Alignment Functionconcept0.781
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
AI alignmentconcept0.767
Field within which this work has implications for evaluating alignment progress.
Data-Centric Alignmentconcept0.761
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
Alignment Typeconcept0.760
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy