finding

active

finding:the-better-an-llm-is-at-language-modeling-the-more-it-aligns-with-vision-models-and-vice-versa-linear-relationship-between-language-modeling-score-and-vision-language-alignment

The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignment

Core cross-modal empirical result: larger and better language models align better with vision models

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Hypotheses (2)

hypothesis

Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
associated_withsupports
The central hypothesis of the paper; the platonic representation hypothesis itself
Scaling model size, as well as data and task diversity, drives representational convergence toward the platonic representation
supports
Core mechanism hypothesis connecting PRH to the empirical trend of scaling in AI

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.853
Key cross-modal alignment result
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.843
Implication of PRH for cross-modal training efficiency
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.837
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Auditory models are roughly aligned with LLMs up to a linear transformationfinding0.826
Ngo & Kim result extending cross-modal convergence to the auditory domain
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.820
Establishes that the observed linear structure is not merely a representation of text probability
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.819
Claims that alignment score is a proxy for general capability
LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representationsclaim0.817
Supporting evidence for cross-modal platonic representation
LLM alignment to DINOv2 vision model shows a linear relationship with HellaSwag (commonsense reasoning) performancefinding0.815
Supports claim that cross-modal alignment predicts downstream language task performance