paper
active
2024
25
paper:doi-10-48550-arxiv-2405-07987

The Platonic Representation Hypothesis

TL;DR

Neural networks trained on different data modalities, architectures, and objectives are converging toward a shared statistical model of reality — what the paper terms the "platonic representation" — formalized as the pointwise mutual information (PMI) kernel over co-occurring events in the world. Measured via a mutual nearest-neighbor alignment metric across 78 vision models evaluated on the 19-task VTAB benchmark, models that solve more downstream tasks cluster tightly together while weaker models scatter; the top-performing bin is markedly more internally aligned than the lowest. Cross-modal convergence is equally pronounced: across families spanning BLOOM (560M–7.1B parameters), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs measured on the Wikipedia Image-Text (WIT) dataset, language modeling performance (1−bits-per-byte on OpenWebText) predicts vision-language alignment with a near-linear relationship, and LLM alignment with DINOv2 predicts Hellaswag commonsense accuracy linearly and GSM8K math accuracy in an emergence-like step. Three selective pressures drive convergence: the Multitask Scaling Hypothesis (more tasks shrink the feasible solution set), the Capacity Hypothesis (larger models more reliably reach shared optima), and the Simplicity Bias Hypothesis (deep networks implicitly favor low-complexity solutions). The paper argues this implies that modality-agnostic representations are not an artifact of shared training recipes but an attractor determined by the statistical structure of reality itself, with downstream consequences including cross-modal data interchangeability, reduced hallucination at scale, and the practical ease of linear stitching between modalities.

What to take away

  1. 1. Among 78 vision models evaluated on 19 VTAB tasks, representational alignment (mutual k-NN, k=10, measured on Places-365) increases monotonically with average transfer performance, such that the highest-performing quintile forms a tight cluster while the lowest-performing quintile shows maximally variable representations.
  2. 2. Across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) language model families, language modeling score (1−bits-per-byte on OpenWebText) predicts mutual nearest-neighbor alignment to DINOv2 vision features with a near-linear relationship when measured on 1,024 WIT image-caption pairs.
  3. 3. LLM alignment score to DINOv2 also predicts downstream task accuracy on Hellaswag (commonsense reasoning) with a linear trend, and on GSM8K (5-shot math) with an emergence-like step function, linking representational geometry to behavioral capability.
  4. 4. The paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), measuring the mean intersection of k-NN sets induced by two representation kernels normalized by k, and shows it is more sensitive to cross-modal alignment trends than CKA, which exhibits weak trends even within the same modality.
  5. 5. A formal proof shows that contrastive learners using binary NCE or InfoNCE objectives are minimized by representations whose dot-product kernel equals the pointwise mutual information (PMI) kernel K_PMI over co-occurring observation pairs, grounding the platonic representation in a concrete mathematical object.
  6. 6. Color representations learned from pixel co-occurrence statistics in CIFAR-10 images and from SimCSE/RoBERTa sentence embeddings of color words both recover approximately the same perceptual organization as CIELAB color space via multidimensional scaling, providing a direct empirical test of cross-modal PMI convergence.
  7. 7. Using LLaMA3-8B-Instruct to generate summaries of Densely-Captioned-Images (DCI) captions at 5, 10, 20, and 30 words, alignment between language and vision models increases monotonically with caption density, consistent with the prediction that higher mutual information between modalities produces stronger representational convergence.
  8. 8. CLIP models fine-tuned on ImageNet-12K classification (CLIP I12K ft) show lower cross-modal alignment with language models than their pre-fine-tuning counterparts, demonstrating that task-narrowing after pretraining actively reduces representational generality.
  9. 9. An open question the paper explicitly leaves unresolved is whether the achieved cross-modal mutual nearest-neighbor alignment score of approximately 0.16 (out of a theoretical maximum of 1.0) represents near-complete alignment with residual noise or genuinely poor alignment with substantial structure yet to be explained.
  10. 10. The Multitask Scaling Hypothesis predicts that the set of representations competent for N tasks is strictly smaller than the set competent for M < N tasks, implying that training on maximally diverse objectives — not merely scale alone — is a necessary condition for convergence to the platonic representation.

Peer brief — for seminar discussion

The paper proposes and empirically supports the Platonic Representation Hypothesis: that neural networks trained on different data, objectives, and modalities are converging toward a single shared representation of reality, formalized as the pointwise mutual information (PMI) kernel over co-occurring world events. To measure convergence, the paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), which computes the mean intersection of k-NN sets from two representation kernels; this was chosen over CKA because CKA showed weak or noisy trends even in within-modality comparisons. An alternative the authors could have used is model stitching performance, which Bansal et al. (2021) demonstrated captures aspects of alignment invisible to CKA, though it is computationally heavier and harder to extend to cross-modal settings. The load-bearing empirical finding is twofold. First, among 78 vision models measured on Places-365, alignment increases monotonically with VTAB transfer performance across 19 tasks, with high-performing models clustering tightly and low-performing models scattering — consistent with the Anna Karenina framing borrowed from Bansal et al. Second, across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs on 1,024 WIT image-caption samples, language modeling performance (1−bits-per-byte on OpenWebText) predicts language-vision alignment with a near-linear relationship — and that alignment in turn predicts Hellaswag accuracy linearly and GSM8K accuracy in an emergence-like step. Theoretical support comes from proving that binary NCE and InfoNCE contrastive objectives are minimized by representations whose kernel equals K_PMI, and that under sufficient world smoothness K_PMI is exactly expressible as inner products of learned features. Three hypotheses are named as selective pressures: the Multitask Scaling Hypothesis, the Capacity Hypothesis, and the Simplicity Bias Hypothesis. The implications are substantial: if modalities share a PMI attractor, then image and language data should be interchangeable for improving either modality's model; cross-modal linear stitching should be cheap; and hallucination and bias should decrease with scale as representations better reflect reality's statistics. The paper also predicts that fine-tuning toward narrow tasks — demonstrated by CLIP fine-tuned on ImageNet-12K showing reduced language alignment — actively degrades platonic convergence. The most contestable element is the alignment metric's interpretability: a mutual nearest-neighbor score of 0.16 against a theoretical maximum of 1.0 is reported for the cross-modal case, and the paper explicitly leaves open whether this reflects near-complete alignment corrupted by noise or genuinely weak alignment with most structure unexplained. A critical reader would also push back on the causal claim that convergence is driven toward a representation of reality rather than toward a representation of the particular internet-scale data distribution used to train these models — the two are conflated by the bijection assumption in the mathematical framework, which the paper itself acknowledges breaks down for lossy or stochastic observations. The restriction to vision and language, with robotics and other embodied modalities noted as lagging, further limits the generality of the current evidence.

Methods (8)

Frameworks (1)

Findings (25)

Claims (20)

Hypotheses (13)

Questions (9)

Original abstract (expand)

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+24 more

Similar preprints — Semantic Scholar

Cited by (1)

  • Model Alignment Search

    Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us