Towards the Platonic Representation via Multimodal Contrastive Alignment
Project Overview
This project, completed as part of MIT’s Deep Learning course (6.7960), explores whether representations from disparate pre-trained unimodal neural networks can be aligned into a shared multimodal latent space. Inspired by CLIP and the Platonic Representation Hypothesis, we use lightweight adapters to align frozen encoders without expensive joint retraining.
Key Contributions
- Novel Framework: Align pre-trained unimodal encoders (ResNet-18, DistilBERT) using simple linear adapters
- Theoretical Validation: Empirical evidence supporting the Platonic Representation Hypothesis
- Strong Results: Multimodal representations outperform unimodal ones when compared to DINOv2 embeddings
Resources
Collaboration with Gabe Manso