Towards the Platonic Representation via Multimodal Contrastive Alignment

Project Overview

This project, completed as part of MIT’s Deep Learning course (6.7960), explores whether representations from disparate pre-trained unimodal neural networks can be aligned into a shared multimodal latent space. Inspired by CLIP and the Platonic Representation Hypothesis, we use lightweight adapters to align frozen encoders without expensive joint retraining.

Key Contributions

Novel Framework: Align pre-trained unimodal encoders (ResNet-18, DistilBERT) using simple linear adapters
Theoretical Validation: Empirical evidence supporting the Platonic Representation Hypothesis
Strong Results: Multimodal representations outperform unimodal ones when compared to DINOv2 embeddings

Resources

Full Technical Blog Post
GitHub Repository

Collaboration with Gabe Manso

Share on

Twitter Facebook LinkedIn

Quilee Simeon

Project Overview

Key Contributions

Resources

Share on