Localist Alignment Baseline

Baseline that finds the axis-aligned orthogonal matrix closest to the learned distributed rotation, assuming disjoint neuron groups.

Neighborhood — ranked by edge-count

method

Algorithm 1: Finding Localist Alignment Matrix
implements
Algorithm that extracts a localist (axis-aligned) approximation from any learned orthogonal rotation matrix for baseline comparison.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.793
Shows localist alignment fails to capture the distributed structure found by DAS.
Localist alignment achieves ~0.51 IIA on MoNLI tasks, near chance performancefinding0.793
Localist methods fail entirely on MoNLI distributed representations.
Localist Representationsconcept0.759
Prior assumption that high-level variables align with disjoint groups of neurons in standard basis; contrasted with distributed representations.
Representational Alignmentconcept0.729
Measure of similarity between the similarity structures (kernels) induced by two different representations
Alignment Between High-Level and Low-Level Modelsconcept0.723
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
Linear Alignment Map (ϕ_lin)method0.719
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Centered Kernel Nearest-Neighbor Alignmentmethod0.714
Modified CKA metric that restricts cross-covariance to nearest neighbors; introduced in this paper's appendix
Alignmentconcept0.703
The goal of making model behavior match human values and intentions, often addressed during post-training.