claim

active

claim:llms-trained-only-on-language-data-have-rich-knowledge-of-visual-structures-sufficient-to-train-decent-visual-representations

LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representations

Supporting evidence for cross-modal platonic representation

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Findings (1)

finding

LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM
restates
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure

Hypotheses (1)

hypothesis

Training on image data should improve LLM performance, and training on language data should improve vision model performance
supports
Implication of PRH for cross-modal training efficiency

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.848
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
Understanding how LMs learn linguistic behaviours may offer insights into fundamental properties of languagehypothesis0.819
Forward-looking hypothesis linking LM mechanism analysis to linguistic theory
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.817
Core cross-modal empirical result: larger and better language models align better with vision models
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.817
Establishes that the observed linear structure is not merely a representation of text probability
LLMs hierarchically develop understanding of their input data, progressing from surface-level features in early layers to more abstract concepts in later layersclaim0.797
Interpretation of the layer-by-layer PCA visualizations showing linear structure emerging in early-middle layers
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.793
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.783
discussion of potential confounds
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.783
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM