Towards the Platonic Representation via Multimodal Contrastive Alignment

Project Overview

This project, completed as part of MIT’s Deep Learning course (6.7960), explores whether representations from disparate pre-trained unimodal neural networks can be aligned into a shared multimodal latent space. Inspired by CLIP and the Platonic Representation Hypothesis, we use lightweight adapters to align frozen encoders without expensive joint retraining.

Key Contributions

  • Novel Framework: Align pre-trained unimodal encoders (ResNet-18, DistilBERT) using simple linear adapters
  • Theoretical Validation: Empirical evidence supporting the Platonic Representation Hypothesis
  • Strong Results: Multimodal representations outperform unimodal ones when compared to DINOv2 embeddings

Resources

Collaboration with Gabe Manso