You can align cross-lingual embeddings for LLMs using translation datasets by learning a mapping matrix between monolingual embedding spaces through techniques like Procrustes alignment.
Here is the code snippet below:

In the above code we are using the following key points:
-
Bilingual embeddings generated from a translation-aligned dataset.
-
PCA for dimensionality reduction and noise filtering.
-
Orthogonal Procrustes algorithm to find a linear alignment matrix.
Hence, this method provides an efficient and interpretable way to align multilingual embeddings using paired translation data.