hypothesis

active

hypothesis:training-on-image-data-should-improve-llm-performance-and-training-on-language-data-should-improve-vision-model-performance

Training on image data should improve LLM performance, and training on language data should improve vision model performance

Implication of PRH for cross-modal training efficiency

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Findings (1)

finding

Jointly training a language model with a vision model improves performance on language tasks compared to training the language model alone
supports
OpenAI GPT-4V finding supporting cross-modal training benefit

Claims (1)

claim

LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representations
supports
Supporting evidence for cross-modal platonic representation

Concepts (1)

concept

Molyneux's Problem
analogous_to
Philosophical thought experiment: can a blind person upon gaining sight recognize shapes? Used to illuminate cross-modal grounding

Quotes (1)

quote

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
extends
The paper's central thesis statement, presented prominently after the abstract

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.872
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.843
Core cross-modal empirical result: larger and better language models align better with vision models
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.786
Key cross-modal alignment result
A single linear projection is sufficient to stitch a vision model to an LLM and achieve good performance on visual question answering and image captioningfinding0.782
Merullo et al. result on cross-modal representational compatibility
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.776
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
Because an LLM's training data contain many instances of the rogue AI trope, the danger is that life will imitate art, quite literallyclaim0.776
Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layersfinding0.775
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.773
Claims that alignment score is a proxy for general capability