finding

active

finding:llms-trained-only-on-language-data-have-rich-enough-knowledge-of-visual-structures-that-decent-visual-representations-can-be-trained-on-images-generated-solely-by-querying-the-llm

LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM

Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representations
restates
Supporting evidence for cross-modal platonic representation

Hypotheses (1)

hypothesis

Language models would achieve some notion of grounding in the visual domain even in the absence of cross-modal training data, because they share a common modality-agnostic representation
supports
Implication of PRH for language model visual grounding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.872
Implication of PRH for cross-modal training efficiency
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.844
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.837
Core cross-modal empirical result: larger and better language models align better with vision models
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.833
Establishes that the observed linear structure is not merely a representation of text probability
Understanding how LMs learn linguistic behaviours may offer insights into fundamental properties of languagehypothesis0.817
Forward-looking hypothesis linking LM mechanism analysis to linguistic theory
LLMs hierarchically develop understanding of their input data, progressing from surface-level features in early layers to more abstract concepts in later layersclaim0.807
Interpretation of the layer-by-layer PCA visualizations showing linear structure emerging in early-middle layers
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.806
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.801
discussion of potential confounds

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representations