finding

active

finding:better-llms-measured-by-1-bits-per-byte-on-openwebtext-show-a-linear-relationship-with-alignment-to-vision-models-measured-via-mutual-nearest-neighbor-on-wit

Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WIT

Key cross-modal alignment result

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalities
supports
Primary empirical claim of the paper

Quotes (1)

quote

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
associated_with
The paper's central thesis statement, presented prominently after the abstract

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.853
Core cross-modal empirical result: larger and better language models align better with vision models
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.819
Establishes that the observed linear structure is not merely a representation of text probability
Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.806
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
LLM alignment to DINOv2 vision model shows a linear relationship with HellaSwag (commonsense reasoning) performancefinding0.805
Supports claim that cross-modal alignment predicts downstream language task performance
Among 78 vision models, those solving more VTAB tasks (higher transfer performance) show higher mutual nearest-neighbor alignment with each otherfinding0.803
Key empirical finding establishing that representational alignment correlates with model competence
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.800
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.798
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.795
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness