finding
active
finding:clip-models-exhibit-higher-language-vision-alignment-than-supervised-or-self-supervised-vision-models-but-this-alignment-decreases-after-fine-tuning-on-imagenet-classificationCLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classification
CLIP training paradigm finding in cross-modal alignment
Source paper
extracted_from(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola
Neighborhood — ranked by edge-count
Claims (1)
claim
- Implication of PRH for training practice: both modalities point at the same underlying reality
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claims that alignment score is a proxy for general capability
- Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
- Core cross-modal empirical result: larger and better language models align better with vision models
- Higher information (denser) captions should yield higher language-vision alignment scoreshypothesis0.793Tests the information-level cap on cross-modal alignment
- Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
- Supports the claim that information content of modality pairing determines alignment level
- OpenAI GPT-4V finding supporting cross-modal training benefit
- Empirical result showing alignment increases with model competence