finding
active
finding:cross-modal-language-vision-alignment-reaches-a-maximum-of-approximately-0-16-on-mutual-nearest-neighbor-metric-in-figure-3-well-below-the-theoretical-maximum-of-1Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
Source paper
extracted_from(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola
Neighborhood — ranked by edge-count
Questions (1)
question
- Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements
Findings (1)
finding
- Shows cross-modal alignment is primarily local rather than global
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supports the claim that information content of modality pairing determines alignment level
- Claims that alignment score is a proxy for general capability
- Validates robustness of alignment metric choice
- Core cross-modal empirical result: larger and better language models align better with vision models
- Key cross-modal alignment result
- Key empirical finding establishing that representational alignment correlates with model competence
- Tests information-level cap on cross-modal alignment
- Key limitation of the formal PRH derivation: lossy or stochastic observation functions weaken the convergence guarantee