finding

active

finding:a-vision-model-trained-on-imagenet-can-be-aligned-with-a-model-trained-on-places-365-while-maintaining-good-performance-and-early-layers-are-more-interchangeable-than-later-layers

A vision model trained on ImageNet can be aligned with a model trained on Places-365 while maintaining good performance, and early layers are more interchangeable than later layers

Lenc & Vedaldi result illustrating data independence in representations and layer-wise alignment

Source paper

extracted_from

The Platonic Representation Hypothesis

(2024) · Minyoung Huh · Brian Cheung · Tongzhou Wang · Phillip Isola

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaces
supports
The central hypothesis of the paper; the platonic representation hypothesis itself

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.811
CLIP training paradigm finding in cross-modal alignment
Olah et al. (2020) found that automatically trained computer vision models, regardless of architecture and training procedure, all arrive at similar functional structures organizing similar features into similar compositional hierarchies, closely resembling the primate visual cortex.finding0.808
Empirical finding supporting the Universality Hypothesis; extended by the paper to consciousness
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.806
Claims that alignment score is a proxy for general capability
Among 78 vision models on Places-365, models that solve more VTAB tasks tend to be more aligned with each other, with high-performance models forming a tightly clustered setfinding0.805
Empirical result showing alignment increases with model competence
Diverse computer vision models trained on visual recognition tasks converge to remarkably similar internal feature representations regardless of architecture, training procedure, or implementation details, closely matching the organization of animal visual cortexfinding0.793
Empirical evidence for the universality hypothesis cited as supporting the possibility of convergent consciousness-like solutions
Among 78 vision models, those solving more VTAB tasks (higher transfer performance) show higher mutual nearest-neighbor alignment with each otherfinding0.781
Key empirical finding establishing that representational alignment correlates with model competence
Jointly training a language model with a vision model improves performance on language tasks compared to training the language model alonefinding0.781
OpenAI GPT-4V finding supporting cross-modal training benefit
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.775
Implication of PRH for cross-modal training efficiency