Algorithm 1: Finding Localist Alignment Matrix

Algorithm that extracts a localist (axis-aligned) approximation from any learned orthogonal rotation matrix for baseline comparison.

Neighborhood — ranked by edge-count

method

Localist Alignment Baseline
implements
Baseline that finds the axis-aligned orthogonal matrix closest to the learned distributed rotation, assuming disjoint neuron groups.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.777
The goal of making model behavior match human values and intentions, often addressed during post-training.
Alignment Problemconcept0.768
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.764
Shows localist alignment fails to capture the distributed structure found by DAS.
Ai Alignment Problemconcept0.759
AI alignmentconcept0.755
Field within which this work has implications for evaluating alignment progress.
Representational Alignmentconcept0.751
Measure of similarity between the similarity structures (kernels) induced by two different representations
Alignment-Faking Reasoning Classifiermethod0.744
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
Alignment Between High-Level and Low-Level Modelsconcept0.743
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.