finding

active

finding:clip-models-exhibit-higher-language-vision-alignment-than-supervised-or-self-supervised-vision-models-but-this-alignment-decreases-after-fine-tuning-on-imagenet-classification

CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classification

CLIP training paradigm finding in cross-modal alignment

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

If there is a modality-agnostic platonic representation, training on both image and language data should improve the best model in either modality
supports
Implication of PRH for training practice: both modalities point at the same underlying reality

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.820
Claims that alignment score is a proxy for general capability
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layersfinding0.811
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.793
Core cross-modal empirical result: larger and better language models align better with vision models
Higher information (denser) captions should yield higher language-vision alignment scoreshypothesis0.793
Tests the information-level cap on cross-modal alignment
Olah et al. (2020) found that automatically trained computer vision models, regardless of architecture and training procedure, all arrive at similar functional structures organizing similar features into similar compositional hierarchies, closely resembling the primate visual cortex.finding0.781
Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
Increasing caption density (from ~5 to ~30 words) monotonically improves language-vision alignment scores across all vision model familiesfinding0.781
Supports the claim that information content of modality pairing determines alignment level
Jointly training a language model with a vision model improves performance on language tasks compared to training the language model alonefinding0.775
OpenAI GPT-4V finding supporting cross-modal training benefit
Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered setfinding0.771
Empirical result showing alignment increases with model competence