finding

active

finding:cross-modal-language-vision-alignment-reaches-a-maximum-of-approximately-0-16-on-mutual-nearest-neighbor-metric-in-figure-3-well-below-the-theoretical-maximum-of-1

Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1

Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Questions (1)

question

Is a mutual nearest-neighbor alignment score of 0.16 indicative of strong alignment with remaining gap being noise, or does it signify poor alignment with major differences left to explain?
gates
Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements

Findings (1)

finding

As number of nearest neighbors k decreases in CKNNA metric, cross-modal alignment trend becomes more pronounced across both models and tasks
supports
Shows cross-modal alignment is primarily local rather than global

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Increasing caption density (from ~5 to ~30 words) monotonically improves language-vision alignment scores across all vision model familiesfinding0.784
Supports the claim that information content of modality pairing determines alignment level
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.780
Claims that alignment score is a proxy for general capability
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.775
Validates robustness of alignment metric choice
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.773
Core cross-modal empirical result: larger and better language models align better with vision models
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.766
Key cross-modal alignment result
Among 78 vision models, those solving more VTAB tasks (higher transfer performance) show higher mutual nearest-neighbor alignment with each otherfinding0.765
Key empirical finding establishing that representational alignment correlates with model competence
Increasing caption length from ~5 words to ~30 words in DCI dataset improves average language-vision alignment scores across all evaluated model pairsfinding0.760
Tests information-level cap on cross-modal alignment
The mathematical argument for cross-modal convergence strictly holds only for bijective projections of the underlying worldclaim0.755
Key limitation of the formal PRH derivation: lossy or stochastic observation functions weaken the convergence guarantee