Brute-Force Alignment Search

Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.

Neighborhood — ranked by edge-count

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

method

Distributed Alignment Search
extends
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.813
The goal of making model behavior match human values and intentions, often addressed during post-training.
Alignment Function (AF)method0.802
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
AI alignmentconcept0.791
Field within which this work has implications for evaluating alignment progress.
Alignment Functionconcept0.789
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
DAS finds better alignments than brute-force search by using gradient descent rather than exhaustive discrete searchclaim0.785
Second central claim of the paper.
Alignment Problemconcept0.776
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Ai Alignment Problemconcept0.773
Alignment Typeconcept0.772
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy