The Platonic Representation Hypothesis

ByMinyoung Huh·Brian Cheung ⓘ·Tongzhou Wang ⓘ·Phillip IsolaMit

DOI 10.48550/arxiv.2405.07987 arXiv 2405.07987 OpenAlex W4396914181

Gemini Idealized World Model (discrete events)Centered Kernel Alignment GPT-4V Centered Kernel Nearest-Neighbor Alignment Hardware Lottery Cycle k-NN LLaVA Edit Distance k-NN Platonic Representation Longest Common Subsequence k-NN Representational Convergence Model Stitching Mutual k-Nearest Neighbor Alignment Metric+1 more

TL;DR

Neural networks trained on different data modalities, architectures, and objectives are converging toward a shared statistical model of reality — what the paper terms the "platonic representation" — formalized as the pointwise mutual information (PMI) kernel over co-occurring events in the world. Measured via a mutual nearest-neighbor alignment metric across 78 vision models evaluated on the 19-task VTAB benchmark, models that solve more downstream tasks cluster tightly together while weaker models scatter; the top-performing bin is markedly more internally aligned than the lowest. Cross-modal convergence is equally pronounced: across families spanning BLOOM (560M–7.1B parameters), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs measured on the Wikipedia Image-Text (WIT) dataset, language modeling performance (1−bits-per-byte on OpenWebText) predicts vision-language alignment with a near-linear relationship, and LLM alignment with DINOv2 predicts Hellaswag commonsense accuracy linearly and GSM8K math accuracy in an emergence-like step. Three selective pressures drive convergence: the Multitask Scaling Hypothesis (more tasks shrink the feasible solution set), the Capacity Hypothesis (larger models more reliably reach shared optima), and the Simplicity Bias Hypothesis (deep networks implicitly favor low-complexity solutions). The paper argues this implies that modality-agnostic representations are not an artifact of shared training recipes but an attractor determined by the statistical structure of reality itself, with downstream consequences including cross-modal data interchangeability, reduced hallucination at scale, and the practical ease of linear stitching between modalities.

What to take away

1. Among 78 vision models evaluated on 19 VTAB tasks, representational alignment (mutual k-NN, k=10, measured on Places-365) increases monotonically with average transfer performance, such that the highest-performing quintile forms a tight cluster while the lowest-performing quintile shows maximally variable representations.
2. Across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) language model families, language modeling score (1−bits-per-byte on OpenWebText) predicts mutual nearest-neighbor alignment to DINOv2 vision features with a near-linear relationship when measured on 1,024 WIT image-caption pairs.
3. LLM alignment score to DINOv2 also predicts downstream task accuracy on Hellaswag (commonsense reasoning) with a linear trend, and on GSM8K (5-shot math) with an emergence-like step function, linking representational geometry to behavioral capability.
4. The paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), measuring the mean intersection of k-NN sets induced by two representation kernels normalized by k, and shows it is more sensitive to cross-modal alignment trends than CKA, which exhibits weak trends even within the same modality.
5. A formal proof shows that contrastive learners using binary NCE or InfoNCE objectives are minimized by representations whose dot-product kernel equals the pointwise mutual information (PMI) kernel K_PMI over co-occurring observation pairs, grounding the platonic representation in a concrete mathematical object.
6. Color representations learned from pixel co-occurrence statistics in CIFAR-10 images and from SimCSE/RoBERTa sentence embeddings of color words both recover approximately the same perceptual organization as CIELAB color space via multidimensional scaling, providing a direct empirical test of cross-modal PMI convergence.
7. Using LLaMA3-8B-Instruct to generate summaries of Densely-Captioned-Images (DCI) captions at 5, 10, 20, and 30 words, alignment between language and vision models increases monotonically with caption density, consistent with the prediction that higher mutual information between modalities produces stronger representational convergence.
8. CLIP models fine-tuned on ImageNet-12K classification (CLIP I12K ft) show lower cross-modal alignment with language models than their pre-fine-tuning counterparts, demonstrating that task-narrowing after pretraining actively reduces representational generality.
9. An open question the paper explicitly leaves unresolved is whether the achieved cross-modal mutual nearest-neighbor alignment score of approximately 0.16 (out of a theoretical maximum of 1.0) represents near-complete alignment with residual noise or genuinely poor alignment with substantial structure yet to be explained.
10. The Multitask Scaling Hypothesis predicts that the set of representations competent for N tasks is strictly smaller than the set competent for M < N tasks, implying that training on maximally diverse objectives — not merely scale alone — is a necessary condition for convergence to the platonic representation.

Peer brief — for seminar discussion

The paper proposes and empirically supports the Platonic Representation Hypothesis: that neural networks trained on different data, objectives, and modalities are converging toward a single shared representation of reality, formalized as the pointwise mutual information (PMI) kernel over co-occurring world events. To measure convergence, the paper introduces the mutual k-nearest-neighbor alignment metric (m_NN), which computes the mean intersection of k-NN sets from two representation kernels; this was chosen over CKA because CKA showed weak or noisy trends even in within-modality comparisons. An alternative the authors could have used is model stitching performance, which Bansal et al. (2021) demonstrated captures aspects of alignment invisible to CKA, though it is computationally heavier and harder to extend to cross-modal settings. The load-bearing empirical finding is twofold. First, among 78 vision models measured on Places-365, alignment increases monotonically with VTAB transfer performance across 19 tasks, with high-performing models clustering tightly and low-performing models scattering — consistent with the Anna Karenina framing borrowed from Bansal et al. Second, across BLOOM (560M–7.1B), OpenLLaMA (3B–13B), and LLaMA (7B–65B) paired against DINOv2, MAE, CLIP, and ImageNet-21K ViTs on 1,024 WIT image-caption samples, language modeling performance (1−bits-per-byte on OpenWebText) predicts language-vision alignment with a near-linear relationship — and that alignment in turn predicts Hellaswag accuracy linearly and GSM8K accuracy in an emergence-like step. Theoretical support comes from proving that binary NCE and InfoNCE contrastive objectives are minimized by representations whose kernel equals K_PMI, and that under sufficient world smoothness K_PMI is exactly expressible as inner products of learned features. Three hypotheses are named as selective pressures: the Multitask Scaling Hypothesis, the Capacity Hypothesis, and the Simplicity Bias Hypothesis. The implications are substantial: if modalities share a PMI attractor, then image and language data should be interchangeable for improving either modality's model; cross-modal linear stitching should be cheap; and hallucination and bias should decrease with scale as representations better reflect reality's statistics. The paper also predicts that fine-tuning toward narrow tasks — demonstrated by CLIP fine-tuned on ImageNet-12K showing reduced language alignment — actively degrades platonic convergence. The most contestable element is the alignment metric's interpretability: a mutual nearest-neighbor score of 0.16 against a theoretical maximum of 1.0 is reported for the cross-modal case, and the paper explicitly leaves open whether this reflects near-complete alignment corrupted by noise or genuinely weak alignment with most structure unexplained. A critical reader would also push back on the causal claim that convergence is driven toward a representation of reality rather than toward a representation of the particular internet-scale data distribution used to train these models — the two are conflated by the bijection assumption in the mathematical framework, which the paper itself acknowledges breaks down for lossy or stochastic observations. The restriction to vision and language, with robotics and other embodied modalities noted as lagging, further limits the generality of the current evidence.

Methods (8)

Centered Kernel Alignment
Standard alignment metric cited and compared against; measures global kernel similarity between representations
Centered Kernel Nearest-Neighbor Alignment
Modified CKA metric that restricts cross-covariance to nearest neighbors; introduced in this paper's appendix
Cycle k-NN
Alternative alignment metric; measures whether nearest neighbor in one domain also considers original sample as nearest neighbor in other domain
Edit Distance k-NN
Alternative alignment metric compared in appendix; computes edit distance between nearest neighbor lists
Longest Common Subsequence k-NN
Alternative alignment metric compared in appendix; calculates longest common subsequence of nearest neighbor lists
Model Stitching
Technique to measure representational compatibility by integrating intermediate representations of one model into another
Mutual k-Nearest Neighbor Alignment Metric
Primary alignment metric used in experiments; measures mean intersection of k-nearest neighbor sets between two kernels
Singular Vector Canonical Correlation Analysis
Alternative alignment metric compared in appendix experiments

Frameworks (1)

Idealized World Model (discrete events)
Mathematical formalization of a world with T discrete events and bijective observation functions, used to prove PMI convergence

Findings (25)

PMI computed from color cooccurrences in CIFAR-10 images yields a perceptual color representation closely matching both CIELAB space and language model embeddings (SimCSE, RoBERTa)
Validates theoretical PMI convergence claim on real data
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignment
Core cross-modal empirical result: larger and better language models align better with vision models
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layers
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLM
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105
Validates robustness of alignment metric choice
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classification
CLIP training paradigm finding in cross-modal alignment
Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered set
Empirical result showing alignment increases with model competence
Rosetta Neurons — individual neurons activated by the same patterns across a range of diverse vision models form a common dictionary independently discovered by all models
Cited evidence that convergence extends to the neuron level, not just representational geometry
Color distances learned from language cooccurrence statistics closely mirror those learned from image cooccurrence statistics and human perceptual distances (CIELAB)
Case study confirming that PMI-based learning in different modalities recovers the same perceptual representation
Cross-modal language-vision alignment reaches a maximum of approximately 0.16 on mutual nearest-neighbor metric in Figure 3, well below the theoretical maximum of 1
Quantitative bound on observed alignment; raises the open question of whether this gap reflects noise or real misalignment

Claims (20)

Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signals
Key limitation of the PRH for non-bijective observations
There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalities
Primary empirical claim of the paper
Oriented Gabor-like filters are common in both artificial and biological vision systems, suggesting convergence to a similar initial representational layer
Early evidence of cross-system representational convergence
The more descriptive (higher information) a caption is, the better its LLM representation aligns with the visual representation of the corresponding image
Preliminary test of the information-level limitation of PRH; denser captions = higher cross-modal alignment
Scale is sufficient but not necessarily efficient to reach high levels of intelligence; different methods can scale with different efficiency levels
Implication of PRH for 'scale is all you need' argument
If there is a modality-agnostic platonic representation, training on both image and language data should improve the best model in either modality
Implication of PRH for training practice: both modalities point at the same underlying reality
Larger models should amplify bias less than smaller models, with model biases more accurately reflecting data biases rather than exacerbating them
Implication of PRH for AI fairness and bias
Special-purpose intelligences optimized for narrow tasks may not converge; the PRH only holds for intelligences performing well on many tasks
Key limitation of PRH
Zero-shot model stitching without a learned stitching layer is feasible because different text models embed data in remarkably similar ways
Strong evidence for representational alignment across models
Conditional generation is easier than unconditional because the conditioning data shares the same platonic structure as the generated data
Implication of PRH for generative models

Hypotheses (13)

Language models would achieve some notion of grounding in the visual domain even in the absence of cross-modal training data, because they share a common modality-agnostic representation
Implication of PRH for language model visual grounding
Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
The central hypothesis of the paper; the platonic representation hypothesis itself
Deep networks are biased toward finding simple fits to the data, and the bigger the model the stronger the bias, driving convergence to a smaller solution space
Selective pressure toward convergence via implicit regularization
A family of contrastive learners converges to a representation whose kernel is the pointwise mutual information (PMI) of the underlying events
Mathematical formalization of what representation models converge to
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutions
Selective pressure toward convergence via task generality
Scaling model size, as well as data and task diversity, drives representational convergence toward the platonic representation
Core mechanism hypothesis connecting PRH to the empirical trend of scaling in AI
Training on image data should improve LLM performance, and training on language data should improve vision model performance
Implication of PRH for cross-modal training efficiency
As models scale and converge toward an accurate model of reality, hallucinations should decrease with scale
Implication of PRH for LLM hallucination
Higher information (denser) captions should yield higher language-vision alignment scores
Tests the information-level cap on cross-modal alignment
Bigger models are more likely to converge to a shared representation than smaller models
Selective pressure toward convergence via model capacity

Questions (9)

Is a mutual nearest-neighbor alignment score of 0.16 indicative of strong alignment with remaining gap being noise, or does it signify poor alignment with major differences left to explain?
Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements
Can language really describe the ineffable experience of watching a total solar eclipse, or how could an image convey a concept like 'I believe in the freedom of speech'?
Counterexample question about modality-specific information limits
What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?
Open methodological question acknowledged as limitation
Research gap: representational convergence in robotics has not been demonstrated at the same level as vision and language
Authors note robotics lacks a standardized representation approach and sufficient training data diversity to show PRH effects
Research gap: active debate on the merits and deficiencies of all current ways of measuring representational alignment
Authors acknowledge there is no settled best alignment metric, affecting the interpretation of all convergence findings
Research gap: developing PRH for non-bijective, lossy, or stochastic observation functions and abstract concepts
The authors identify that their formal convergence proof requires bijective modality mappings, leaving a gap for more realistic settings
What has led to representational convergence, will it continue, and ultimately where does it end?
Central motivating questions of the paper
What has led to this convergence? Will it continue? And ultimately, where does it end?
Core research questions motivating the paper
What exactly is the endpoint of representational convergence?
Motivates Section 4 where the PMI-kernel formalization is proposed

Original abstract (expand)

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion
Brian Cheung, Evelina Fedorenko, Alex H. Williams Eghbal A. Hosseini
2026
≈ 88%
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li Zhaoyang Zhang
2026
≈ 87%
Semi-supervised Multimodal Representation Learning through a Global Workspace
L\'eopold Mayti\'e and Rufin VanRullen Benjamin Devillers
2025
≈ 86%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 85%
Beyond Object-Level Alignment: Do Brains and DNNs Preserve the Same Transformations?
Yukiyasu Kamitani
2026
≈ 84%
Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings
Thomas Fel, Victor Boutin, Agustin Picard Gr\'egoire Dhimo\"ila
2026
≈ 84%
Model Alignment Search
in corpus
2025
≈ 84%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 84%
The Indra Representation Hypothesis for Multimodal Alignment
Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu Jianglin Lu
2026
≈ 83%
Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment
Raphael Douady, Chao Chen Lingjie Yi
2025
≈ 83%
Interpreting Neural Networks through the Polytope Lens
Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy Sid Black
2022
≈ 83%
The Umwelt Representation Hypothesis: Rethinking Universality
Rowan Sommers, Adrien Doerig, Tim C Kietzmann Victoria Bosch
2026
≈ 83%
Visual Representations inside the Language Model
Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna Benlin Liu
2025
≈ 83%
Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
Dhruv Kumar Manan Gupta
2025
≈ 83%
Similarity of Processing Steps in Vision Model Representations
Marco Baroni Mat\'eo Mahaut
2026
≈ 83%
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li
2025
≈ 83%
Learning Shared Dynamics with Meta-World Models
Minne Li, Jun Wang Lisheng Wu
2018
≈ 83%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 82%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 82%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 82%
The World Inside Neural Networks
in corpus
2026
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 81%
Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
in corpus
2025
≈ 81%
Anima Labs Phenomenology Pt1
in corpus
≈ 80%
Why Learning Requires Feeling
in corpus
2026
≈ 80%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 80%
Input–output maps are strongly biased towards simple outputs
cited
2018
≈ 75%
Possible Principles Underlying the Transformations of Sensory Messages
cited
2012
≈ 71%
Training verifiers to solve math word problems
cited
2021
≈ 58%
World models
cited
2018
≈ 49%

+24 more

Similar preprints — Semantic Scholar

Cited by (1)

Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us