finding

active

finding:increasing-caption-density-from-5-to-30-words-monotonically-improves-language-vision-alignment-scores-across-all-vision-model-families

Increasing caption density (from ~5 to ~30 words) monotonically improves language-vision alignment scores across all vision model families

Supports the claim that information content of modality pairing determines alignment level

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (2)

claim

The mathematical argument for cross-modal convergence strictly holds only for bijective projections of the underlying world
supports
Key limitation of the formal PRH derivation: lossy or stochastic observation functions weaken the convergence guarantee
The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding image
supports
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Increasing caption length from ~5 words to ~30 words in DCI dataset improves average language-vision alignment scores across all evaluated model pairsfinding0.879
Tests information-level cap on cross-modal alignment
Higher information (denser) captions should yield higher language-vision alignment scoreshypothesis0.863
Tests the information-level cap on cross-modal alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.791
Claims that alignment score is a proxy for general capability
Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1finding0.784
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.781
CLIP training paradigm finding in cross-modal alignment
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.766
Core cross-modal empirical result: larger and better language models align better with vision models
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.753
Implication of PRH for cross-modal training efficiency
Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAfinding0.749
Evidence that multimodal information accelerates convergence speed during training.