method
active
method:brute-force-alignment-searchBrute-Force Alignment Search
Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Causal abstractionimplementsA framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Methods (1)
method
- Distributed Alignment SearchextendsThe core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
- Field within which this work has implications for evaluating alignment progress.
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- Second central claim of the paper.
- The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy