hypothesis

active

hypothesis:higher-information-denser-captions-should-yield-higher-language-vision-alignment-scores

Higher information (denser) captions should yield higher language-vision alignment scores

Tests the information-level cap on cross-modal alignment

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Findings (1)

finding

Increasing caption length from ~5 words to ~30 words in DCI dataset improves average language-vision alignment scores across all evaluated model pairs
supports
Tests information-level cap on cross-modal alignment

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Increasing caption density (from ~5 to ~30 words) monotonically improves language-vision alignment scores across all vision model familiesfinding0.863
Supports the claim that information content of modality pairing determines alignment level
The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding imageclaim0.831
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.798
Claims that alignment score is a proxy for general capability
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.793
CLIP training paradigm finding in cross-modal alignment
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.768
Core cross-modal empirical result: larger and better language models align better with vision models
Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1finding0.753
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
Higher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysisfinding0.752
Shows interpretability correlates with activation strength, most model effect comes from high activations
Glaese et al. 2022: Improving alignment of dialogue agents via targeted human judgementsconcept0.750
Alignment paper cited as example of RLHF fine-tuning technique; ref 19