Search papers, labs, and topics across Lattice.
1
0
2
Independently trained multimodal models like CLIP aren't so independent after all: a single orthogonal transformation can align their embedding spaces across both image and text modalities.