hypothesis

active

hypothesis:language-models-would-achieve-some-notion-of-grounding-in-the-visual-domain-even-in-the-absence-of-cross-modal-training-data-because-they-share-a-common-modality-agnostic-representation

Language models would achieve some notion of grounding in the visual domain even in the absence of cross-modal training data, because they share a common modality-agnostic representation

Implication of PRH for language model visual grounding

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Findings (1)

finding

LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM
supports
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure

Hypotheses (1)

hypothesis

Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
associated_with
The central hypothesis of the paper; the platonic representation hypothesis itself

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If there is a modality-agnostic platonic representation, training on both image and language data should improve the best model in either modalityclaim0.816
Implication of PRH for training practice: both modalities point at the same underlying reality
Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signalsclaim0.784
Key limitation of the PRH for non-bijective observations
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.775
Antra's earlier definitive statement of the tricameral model.
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.771
Claims that alignment score is a proxy for general capability
Language models prefer reusing generic arithmetic mechanisms over learning task-specific modular computations even when task-specific geometry existsclaim0.770
Broader interpretive claim about LM learning bias inferred from the findings
Connectionist models can exhibit learning bottom-up, without centralised control or an external teacher, and without any performance feedback applied at the system level.claim0.769
Key property of distributed unsupervised learning.
The mathematical argument for cross-modal convergence strictly holds only for bijective projections of the underlying worldclaim0.766
Key limitation of the formal PRH derivation: lossy or stochastic observation functions weaken the convergence guarantee
Imagine reading a textbook with no figures or tables. Our ability to knowledge acquisition is greatly strengthened by jointly modeling diverse data modalities, such as vision, language, and audio.quote0.766
Load-bearing motivation for multimodal approach; frames the cognitive advantage of joint modalities.