finding

active

finding:among-78-vision-models-those-solving-more-vtab-tasks-higher-transfer-performance-show-higher-mutual-nearest-neighbor-alignment-with-each-other

Among 78 vision models, those solving more VTAB tasks (higher transfer performance) show higher mutual nearest-neighbor alignment with each other

Key empirical finding establishing that representational alignment correlates with model competence

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Claims (1)

claim

Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own way
supports
Author's interpretation of the VTAB alignment results echoing Tolstoy

Hypotheses (3)

hypothesis

Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
associated_withsupports
The central hypothesis of the paper; the platonic representation hypothesis itself
Multitask Scaling Hypothesis
supports
Argues that there are fewer representations competent for N tasks than M<N tasks, so more general models have a smaller solution space
Capacity Hypothesis
supports
Bigger models are more likely to converge to a shared representation than smaller models because they can better approximate the global optimum

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered setfinding0.893
Empirical result showing alignment increases with model competence
Better LLMs (measured by 1-bits-per-byte on OpenWebText) show a linear relationship with alignment to vision models measured via mutual nearest-neighbor on WITfinding0.803
Key cross-modal alignment result
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.802
Claims that alignment score is a proxy for general capability
Diverse computer vision models trained on visual recognition tasks converge to remarkably similar internal feature representations regardless of architecture, training procedure, or implementation details, closely matching the organization of animal visual cortexfinding0.793
Empirical evidence for the universality hypothesis cited as supporting the possibility of convergent consciousness-like solutions
Olah et al. (2020) found that automatically trained computer vision models, regardless of architecture and training procedure, all arrive at similar functional structures organizing similar features into similar compositional hierarchies, closely resembling the primate visual cortex.finding0.784
Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layersfinding0.781
Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment
Performance-optimized hierarchical models predict neural responses in higher visual cortex (Yamins et al., 2014)concept0.781
Demonstrated CNN representations predict neurons in visual cortex; background motivation for neural-network-brain correspondence.
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.771
Core cross-modal empirical result: larger and better language models align better with vision models