finding

active

finding:llm-alignment-to-dinov2-vision-model-shows-a-linear-relationship-with-hellaswag-commonsense-reasoning-performance

LLM alignment to DINOv2 vision model shows a linear relationship with HellaSwag (commonsense reasoning) performance

Supports claim that cross-modal alignment predicts downstream language task performance

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and math
supports
Claims that alignment score is a proxy for general capability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM alignment score to DINOv2 shows an emergence-esque trend with GSM8K mathematical reasoning performancefinding0.842
Alignment predicts math performance with emergent pattern
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.815
Core cross-modal empirical result: larger and better language models align better with vision models
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.805
Key cross-modal alignment result
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.774
Establishes that the observed linear structure is not merely a representation of text probability
A single linear projection is sufficient to stitch a vision model to an LLM and achieve good performance on visual question answering and image captioningfinding0.763
Merullo et al. result on cross-modal representational compatibility
Auditory models are roughly aligned with LLMs up to a linear transformationfinding0.759
Ngo & Kim result extending cross-modal convergence to the auditory domain
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.754
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.751
Demonstrates that high IIA can be obtained even when model cannot solve the task