finding

active

finding:among-78-vision-models-on-places-365-models-that-solve-more-vtab-tasks-tend-to-be-more-aligned-with-each-other-with-high-performance-models-forming-a-tightly-clustered-set

Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered set

Empirical result showing alignment increases with model competence

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (2)

claim

There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalities
supports
Primary empirical claim of the paper
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own way
supports
Author's interpretation of the VTAB alignment results echoing Tolstoy

Questions (1)

question

What has led to representational convergence, will it continue, and ultimately where does it end?
answered_by
Central motivating questions of the paper

Quotes (1)

quote

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
associated_with
The paper's central thesis statement, presented prominently after the abstract

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Among 78 vision models, those solving more VTAB tasks (higher transfer performance) show higher mutual nearest-neighbor alignment with each otherfinding0.893
Key empirical finding establishing that representational alignment correlates with model competence
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layersfinding0.805
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.780
Claims that alignment score is a proxy for general capability
Diverse computer vision models trained on visual recognition tasks converge to remarkably similar internal feature representations regardless of architecture, training procedure, or implementation details, closely matching the organization of animal visual cortexfinding0.779
Empirical evidence for the universality hypothesis cited as supporting the possibility of convergent consciousness-like solutions
Olah et al. (2020) found that automatically trained computer vision models, regardless of architecture and training procedure, all arrive at similar functional structures organizing similar features into similar compositional hierarchies, closely resembling the primate visual cortex.finding0.774
Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.771
CLIP training paradigm finding in cross-modal alignment
Performance-optimized hierarchical models predict neural responses in higher visual cortex (Yamins et al., 2014)concept0.768
Demonstrated CNN representations predict neurons in visual cortex; background motivation for neural-network-brain correspondence.
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.763
Key cross-modal alignment result