finding

active

finding:a-single-linear-projection-is-sufficient-to-stitch-a-vision-model-to-an-llm-and-achieve-good-performance-on-visual-question-answering-and-image-captioning

A single linear projection is sufficient to stitch a vision model to an LLM and achieve good performance on visual question answering and image captioning

Merullo et al. result on cross-modal representational compatibility

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalities
supports
Primary empirical claim of the paper

Hypotheses (1)

hypothesis

Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
supports
The central hypothesis of the paper; the platonic representation hypothesis itself

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.812
Core cross-modal empirical result: larger and better language models align better with vision models
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.782
Implication of PRH for cross-modal training efficiency
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.780
Key cross-modal alignment result
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.777
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.777
Establishes that the observed linear structure is not merely a representation of text probability
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.767
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding imageclaim0.766
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
LLM alignment to DINOv2 vision model shows a linear relationship with HellaSwag (commonsense reasoning) performancefinding0.763
Supports claim that cross-modal alignment predicts downstream language task performance