finding
active
finding:increasing-caption-length-from-5-words-to-30-words-in-dci-dataset-improves-average-language-vision-alignment-scores-across-all-evaluated-model-pairsIncreasing caption length from ~5 words to ~30 words in DCI dataset improves average language-vision alignment scores across all evaluated model pairs
Tests information-level cap on cross-modal alignment
Source paper
extracted_from(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key limitation of the PRH for non-bijective observations
Hypotheses (1)
hypothesis
- Tests the information-level cap on cross-modal alignment
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supports the claim that information content of modality pairing determines alignment level
- Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
- Core cross-modal empirical result: larger and better language models align better with vision models
- CLIP training paradigm finding in cross-modal alignment
- Claims that alignment score is a proxy for general capability
- Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
- Case study confirming that PMI-based learning in different modalities recovers the same perceptual representation
- Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.731Implication of PRH for cross-modal training efficiency