method
active
method:algorithm-1-finding-localist-alignment-matrixAlgorithm 1: Finding Localist Alignment Matrix
Algorithm that extracts a localist (axis-aligned) approximation from any learned orthogonal rotation matrix for baseline comparison.
Neighborhood — ranked by edge-count
Methods (1)
method
- Localist Alignment BaselineimplementsBaseline that finds the axis-aligned orthogonal matrix closest to the learned distributed rotation, assuming disjoint neuron groups.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.764Shows localist alignment fails to capture the distributed structure found by DAS.
- Field within which this work has implications for evaluating alignment progress.
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.