Representational Alignment

Measure of similarity between the similarity structures (kernels) induced by two different representations

Neighborhood — ranked by edge-count

method

Centered Kernel Alignment
implements
Standard alignment metric cited and compared against; measures global kernel similarity between representations
Mutual k-Nearest Neighbor Alignment Metric
implements
Primary alignment metric used in experiments; measures mean intersection of k-nearest neighbor sets between two kernels

concept

Kernel (representational)
associated_withimplements
A function characterizing how a representation measures distance/similarity between datapoints; used to compare representations
Representational Convergence
associated_with
The central empirical phenomenon: different neural networks trained on different data/objectives develop increasingly similar representations
Cross-Modal Alignment
extends
The alignment between representations learned from different data modalities such as vision and language

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.847
The goal of making model behavior match human values and intentions, often addressed during post-training.
What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?question0.815
Open methodological question acknowledged as limitation
Alignment Functionconcept0.809
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Alignment Problemconcept0.804
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Deliberative Alignmentframework0.803
OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring
AI alignmentconcept0.793
Field within which this work has implications for evaluating alignment progress.
Data-Centric Alignmentconcept0.791
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
Alignment Typeconcept0.787
The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy