finding

active

finding:increasing-caption-length-from-5-words-to-30-words-in-dci-dataset-improves-average-language-vision-alignment-scores-across-all-evaluated-model-pairs

Increasing caption length from ~5 words to ~30 words in DCI dataset improves average language-vision alignment scores across all evaluated model pairs

Tests information-level cap on cross-modal alignment

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signals
supports
Key limitation of the PRH for non-bijective observations

Hypotheses (1)

hypothesis

Higher information (denser) captions should yield higher language-vision alignment scores
supports
Tests the information-level cap on cross-modal alignment

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Increasing caption density (from ~5 to ~30 words) monotonically improves language-vision alignment scores across all vision model familiesfinding0.879
Supports the claim that information content of modality pairing determines alignment level
Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1finding0.760
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.753
Core cross-modal empirical result: larger and better language models align better with vision models
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.753
CLIP training paradigm finding in cross-modal alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.750
Claims that alignment score is a proxy for general capability
The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding imageclaim0.742
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
Color distances learned from language cooccurrence statistics closely mirror those learned from image cooccurrence statistics and human perceptual distances (CIELAB)finding0.739
Case study confirming that PMI-based learning in different modalities recovers the same perceptual representation
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.731
Implication of PRH for cross-modal training efficiency