finding

active

finding:jointly-training-a-language-model-with-a-vision-model-improves-performance-on-language-tasks-compared-to-training-the-language-model-alone

Jointly training a language model with a vision model improves performance on language tasks compared to training the language model alone

OpenAI GPT-4V finding supporting cross-modal training benefit

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

Training on image data should improve LLM performance, and training on language data should improve vision model performance
supports
Implication of PRH for cross-modal training efficiency

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.818
Claims that alignment score is a proxy for general capability
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.792
Core cross-modal empirical result: larger and better language models align better with vision models
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.792
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layersfinding0.781
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.775
CLIP training paradigm finding in cross-modal alignment
We hypothesize that intervention efficiency can be scaled with multi-node and multi-GPU training as language models grow largerhypothesis0.771
Future work hypothesis about scaling pyvene's computational efficiency for very large models
Olah et al. (2020) found that automatically trained computer vision models, regardless of architecture and training procedure, all arrive at similar functional structures organizing similar features into similar compositional hierarchies, closely resembling the primate visual cortex.finding0.770
Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
Models trained directly with asynchronous updates would exhibit even greater robustness than synchronously trained modelshypothesis0.767
Hypothesis that motivated the asynchronous robustness comparison experiment