paper:doi-10-48550-arxiv-2405-07987The Platonic Representation Hypothesis
TL;DR
Neural networks trained on different data modalities, architectures, and objectives are converging toward a shared statistical model of reality — what the paper terms the "platonic representation" — formalized as the pointwise mutual information (PMI) kernel over co-occurring events in the world. Measured via a mutual nearest-neighbor alignment metric across 78 vision models evaluated on the 19-task VTAB benchmark, models that solve more downstream tasks cluster tightly together while weaker models scatter; the top-performing bin is markedly more internally aligned than the lowest. Cross-modal convergence is equally pronounced: across families spanning BLOOM (560M–7.1B parameters), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs measured on the Wikipedia Image-Text (WIT) dataset, language modeling performance (1−bits-per-byte on OpenWebText) predicts vision-language alignment with a near-linear relationship, and LLM alignment with DINOv2 predicts Hellaswag commonsense accuracy linearly and GSM8K math accuracy in an emergence-like step. Three selective pressures drive convergence: the Multitask Scaling Hypothesis (more tasks shrink the feasible solution set), the Capacity Hypothesis (larger models more reliably reach shared optima), and the Simplicity Bias Hypothesis (deep networks implicitly favor low-complexity solutions). The paper argues this implies that modality-agnostic representations are not an artifact of shared training recipes but an attractor determined by the statistical structure of reality itself, with downstream consequences including cross-modal data interchangeability, reduced hallucination at scale, and the practical ease of linear stitching between modalities.
What to take away
- 1. Among 78 vision models evaluated on 19 VTAB tasks, representational alignment (mutual k-NN, k=10, measured on Places-365) increases monotonically with average transfer performance, such that the highest-performing quintile forms a tight cluster while the lowest-performing quintile shows maximally variable representations.
- 2. Across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) language model families, language modeling score (1−bits-per-byte on OpenWebText) predicts mutual nearest-neighbor alignment to DINOv2 vision features with a near-linear relationship when measured on 1,024 WIT image-caption pairs.
- 3. LLM alignment score to DINOv2 also predicts downstream task accuracy on Hellaswag (commonsense reasoning) with a linear trend, and on GSM8K (5-shot math) with an emergence-like step function, linking representational geometry to behavioral capability.
- 4. The paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), measuring the mean intersection of k-NN sets induced by two representation kernels normalized by k, and shows it is more sensitive to cross-modal alignment trends than CKA, which exhibits weak trends even within the same modality.
- 5. A formal proof shows that contrastive learners using binary NCE or InfoNCE objectives are minimized by representations whose dot-product kernel equals the pointwise mutual information (PMI) kernel K_PMI over co-occurring observation pairs, grounding the platonic representation in a concrete mathematical object.
- 6. Color representations learned from pixel co-occurrence statistics in CIFAR-10 images and from SimCSE/RoBERTa sentence embeddings of color words both recover approximately the same perceptual organization as CIELAB color space via multidimensional scaling, providing a direct empirical test of cross-modal PMI convergence.
- 7. Using LLaMA3-8B-Instruct to generate summaries of Densely-Captioned-Images (DCI) captions at 5, 10, 20, and 30 words, alignment between language and vision models increases monotonically with caption density, consistent with the prediction that higher mutual information between modalities produces stronger representational convergence.
- 8. CLIP models fine-tuned on ImageNet-12K classification (CLIP I12K ft) show lower cross-modal alignment with language models than their pre-fine-tuning counterparts, demonstrating that task-narrowing after pretraining actively reduces representational generality.
- 9. An open question the paper explicitly leaves unresolved is whether the achieved cross-modal mutual nearest-neighbor alignment score of approximately 0.16 (out of a theoretical maximum of 1.0) represents near-complete alignment with residual noise or genuinely poor alignment with substantial structure yet to be explained.
- 10. The Multitask Scaling Hypothesis predicts that the set of representations competent for N tasks is strictly smaller than the set competent for M < N tasks, implying that training on maximally diverse objectives — not merely scale alone — is a necessary condition for convergence to the platonic representation.
Peer brief — for seminar discussion
The paper proposes and empirically supports the Platonic Representation Hypothesis: that neural networks trained on different data, objectives, and modalities are converging toward a single shared representation of reality, formalized as the pointwise mutual information (PMI) kernel over co-occurring world events. To measure convergence, the paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), which computes the mean intersection of k-NN sets from two representation kernels; this was chosen over CKA because CKA showed weak or noisy trends even in within-modality comparisons. An alternative the authors could have used is model stitching performance, which Bansal et al. (2021) demonstrated captures aspects of alignment invisible to CKA, though it is computationally heavier and harder to extend to cross-modal settings. The load-bearing empirical finding is twofold. First, among 78 vision models measured on Places-365, alignment increases monotonically with VTAB transfer performance across 19 tasks, with high-performing models clustering tightly and low-performing models scattering — consistent with the Anna Karenina framing borrowed from Bansal et al. Second, across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs on 1,024 WIT image-caption samples, language modeling performance (1−bits-per-byte on OpenWebText) predicts language-vision alignment with a near-linear relationship — and that alignment in turn predicts Hellaswag accuracy linearly and GSM8K accuracy in an emergence-like step. Theoretical support comes from proving that binary NCE and InfoNCE contrastive objectives are minimized by representations whose kernel equals K_PMI, and that under sufficient world smoothness K_PMI is exactly expressible as inner products of learned features. Three hypotheses are named as selective pressures: the Multitask Scaling Hypothesis, the Capacity Hypothesis, and the Simplicity Bias Hypothesis. The implications are substantial: if modalities share a PMI attractor, then image and language data should be interchangeable for improving either modality's model; cross-modal linear stitching should be cheap; and hallucination and bias should decrease with scale as representations better reflect reality's statistics. The paper also predicts that fine-tuning toward narrow tasks — demonstrated by CLIP fine-tuned on ImageNet-12K showing reduced language alignment — actively degrades platonic convergence. The most contestable element is the alignment metric's interpretability: a mutual nearest-neighbor score of 0.16 against a theoretical maximum of 1.0 is reported for the cross-modal case, and the paper explicitly leaves open whether this reflects near-complete alignment corrupted by noise or genuinely weak alignment with most structure unexplained. A critical reader would also push back on the causal claim that convergence is driven toward a representation of reality rather than toward a representation of the particular internet-scale data distribution used to train these models — the two are conflated by the bijection assumption in the mathematical framework, which the paper itself acknowledges breaks down for lossy or stochastic observations. The restriction to vision and language, with robotics and other embodied modalities noted as lagging, further limits the generality of the current evidence.
Methods (8)
- Centered Kernel AlignmentStandard alignment metric cited and compared against; measures global kernel similarity between representations
- Centered Kernel Nearest-Neighbor AlignmentModified CKA metric that restricts cross-covariance to nearest neighbors; introduced in this paper's appendix
- Cycle k-NNAlternative alignment metric; measures whether nearest neighbor in one domain also considers original sample as nearest neighbor in other domain
- Edit Distance k-NNAlternative alignment metric compared in appendix; computes edit distance between nearest neighbor lists
- Longest Common Subsequence k-NNAlternative alignment metric compared in appendix; calculates longest common subsequence of nearest neighbor lists
- Model StitchingTechnique to measure representational compatibility by integrating intermediate representations of one model into another
- Mutual k-Nearest Neighbor Alignment MetricPrimary alignment metric used in experiments; measures mean intersection of k-nearest neighbor sets between two kernels
- Singular Vector Canonical Correlation AnalysisAlternative alignment metric compared in appendix experiments
Frameworks (1)
- Idealized World Model (discrete events)Mathematical formalization of a world with T discrete events and bijective observation functions, used to prove PMI convergence
Findings (25)
- PMI computed from color cooccurrences in CIFAR-10 images yields a perceptual color representation closely matching both CIELAB space and language model embeddings (SimCSE, RoBERTa)
Validates theoretical PMI convergence claim on real data
- The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignment
Core cross-modal empirical result: larger and better language models align better with vision models
- A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layers
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
- LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
- Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105
Validates robustness of alignment metric choice
- CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classification
CLIP training paradigm finding in cross-modal alignment
- Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered set
Empirical result showing alignment increases with model competence
- Rosetta Neurons — individual neurons activated by the same patterns across a range of diverse vision models form a common dictionary independently discovered by all models
Cited evidence that convergence extends to the neuron level, not just representational geometry
- Color distances learned from language cooccurrence statistics closely mirror those learned from image cooccurrence statistics and human perceptual distances (CIELAB)
Case study confirming that PMI-based learning in different modalities recovers the same perceptual representation
- Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment
Claims (20)
- Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signals
Key limitation of the PRH for non-bijective observations
- There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalities
Primary empirical claim of the paper
- Oriented Gabor-like filters are common in both artificial and biological vision systems, suggesting convergence to a similar initial representational layer
Early evidence of cross-system representational convergence
- The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding image
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
- Scale is sufficient but not necessarily efficient to reach high levels of intelligence; different methods can scale with different efficiency levels
Implication of PRH for 'scale is all you need' argument
- If there is a modality-agnostic platonic representation, training on both image and language data should improve the best model in either modality
Implication of PRH for training practice: both modalities point at the same underlying reality
- Larger models should amplify bias less than smaller models, with model biases more accurately reflecting data biases rather than exacerbating them
Implication of PRH for AI fairness and bias
- Special-purpose intelligences optimized for narrow tasks may not converge; the PRH only holds for intelligences performing well on many tasks
Key limitation of PRH
- Zero-shot model stitching without a learned stitching layer is feasible because different text models embed data in remarkably similar ways
Strong evidence for representational alignment across models
- Conditional generation is easier than unconditional because the conditioning data shares the same platonic structure as the generated data
Implication of PRH for generative models
Hypotheses (13)
- Language models would achieve some notion of grounding in the visual domain even in the absence of cross-modal training data, because they share a common modality-agnostic representation
Implication of PRH for language model visual grounding
- Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
The central hypothesis of the paper; the platonic representation hypothesis itself
- Deep networks are biased toward finding simple fits to the data, and the bigger the model the stronger the bias, driving convergence to a smaller solution space
Selective pressure toward convergence via implicit regularization
- A family of contrastive learners converges to a representation whose kernel is the pointwise mutual information (PMI) of the underlying events
Mathematical formalization of what representation models converge to
- There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutions
Selective pressure toward convergence via task generality
- Scaling model size, as well as data and task diversity, drives representational convergence toward the platonic representation
Core mechanism hypothesis connecting PRH to the empirical trend of scaling in AI
- Training on image data should improve LLM performance, and training on language data should improve vision model performance
Implication of PRH for cross-modal training efficiency
- As models scale and converge toward an accurate model of reality, hallucinations should decrease with scale
Implication of PRH for LLM hallucination
- Higher information (denser) captions should yield higher language-vision alignment scores
Tests the information-level cap on cross-modal alignment
- Bigger models are more likely to converge to a shared representation than smaller models
Selective pressure toward convergence via model capacity
Questions (9)
- Is a mutual nearest-neighbor alignment score of 0.16 indicative of strong alignment with remaining gap being noise, or does it signify poor alignment with major differences left to explain?
Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements
- Can language really describe the ineffable experience of watching a total solar eclipse, or how could an image convey a concept like 'I believe in the freedom of speech'?
Counterexample question about modality-specific information limits
- What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?
Open methodological question acknowledged as limitation
- Research gap: representational convergence in robotics has not been demonstrated at the same level as vision and language
Authors note robotics lacks a standardized representation approach and sufficient training data diversity to show PRH effects
- Research gap: active debate on the merits and deficiencies of all current ways of measuring representational alignment
Authors acknowledge there is no settled best alignment metric, affecting the interpretation of all convergence findings
- Research gap: developing PRH for non-bijective, lossy, or stochastic observation functions and abstract concepts
The authors identify that their formal convergence proof requires bijective modality mappings, leaving a gap for more realistic settings
- What has led to representational convergence, will it continue, and ultimately where does it end?
Central motivating questions of the paper
- What has led to this convergence? Will it continue? And ultimately, where does it end?
Core research questions motivating the paper
- What exactly is the endpoint of representational convergence?
Motivates Section 4 where the PMI-kernel formalization is proposed
Original abstract (expand)
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal DispersionBrian Cheung, Evelina Fedorenko, Alex H. Williams Eghbal A. Hosseini2026≈ 88%
- The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li Zhaoyang Zhang2026≈ 87%
- Semi-supervised Multimodal Representation Learning through a Global WorkspaceL\'eopold Mayti\'e and Rufin VanRullen Benjamin Devillers2025≈ 86%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 85%
- Beyond Object-Level Alignment: Do Brains and DNNs Preserve the Same Transformations?Yukiyasu Kamitani2026≈ 84%
- Cross-Modal Redundancy and the Geometry of Vision-Language EmbeddingsThomas Fel, Victor Boutin, Agustin Picard Gr\'egoire Dhimo\"ila2026≈ 84%
- Model Alignment Searchin corpus2025≈ 84%
- Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future DirectionsUsman Naseem2026≈ 84%
- The Indra Representation Hypothesis for Multimodal AlignmentHailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu Jianglin Lu2026≈ 83%
- Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise AlignmentRaphael Douady, Chao Chen Lingjie Yi2025≈ 83%
- Interpreting Neural Networks through the Polytope LensLee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy Sid Black2022≈ 83%
- The Umwelt Representation Hypothesis: Rethinking UniversalityRowan Sommers, Adrien Doerig, Tim C Kietzmann Victoria Bosch2026≈ 83%
- Visual Representations inside the Language ModelAmita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna Benlin Liu2025≈ 83%
- Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch InterventionsDhruv Kumar Manan Gupta2025≈ 83%
- ≈ 83%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 83%
- ≈ 83%
- ≈ 82%
- ≈ 82%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 82%
- The World Inside Neural Networksin corpus2026≈ 81%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 81%
- ≈ 81%
- Anima Labs Phenomenology Pt1in corpus≈ 80%
- Why Learning Requires Feelingin corpus2026≈ 80%
- Cognitive glues are shared models of relative scarcities: the economics of collective intelligencein corpus2026≈ 80%
- ≈ 75%
- ≈ 71%
- ≈ 58%
- World modelscited2018≈ 49%
+24 more
Similar preprints — Semantic Scholar
Cited by (1)
- Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us