Generating word alignments for Russian-Chinese parallel corpus with BERT
Parellel corpus is a collection of texts in two or more languages, aligned on a sentence level. It is a useful tool for linguists studying those languages, as well as for machine learning specialists working with NLP. Most modern machine translation systems are trained using such corpora.
The Russian-Chinese parallel corpus is a part of Russian National Corpus of language. To our best knowledge, it is the first parallel corpus avaiable online for this language pair.
Having not just sentence-level, but also word-level alignments on a parallel corpus is important for a variety of downstream NLP tasks, such as automatic evaluation of translation outputs, generating translation lexicons or knowledge transfer from high-resource to low-resource languages.
This is an example of what word alignment looks like for a short sentence. If there is an X in the cell, it means corresponding Russian and Chinese words are aligned to each other. Generally speaking, this may be a many-to-many relationship.
Example of alignment for the word “carriage” (карета / 车 or 马车) in UI:
For this project, I am collaborating with a team of linguists, machine learning engineers and software developers from the Russian Language Institute of Russian Academy of Sciences and Higher School of Economics.
Project overview:
- Extracting word alignments from BERT and LaBSE’s contextualized word embeddings, as suggested in Word Alignment by Fine-tuning Embeddings on Parallel Corpora by Dou, Zi-Yi and Neubig, Graham
- Further fine-tuning pretrained BERT and LaBSE on unique Russian-Chinese parallel corpus of 2.3 million words / 779,632 sentence pairs
- Working with a team of Chinese language experts to produce ~500 gold standard sentence pairs, manually aligned on word level, to be used for further fine-tuning in supervised learning mode and as test set
- Putting the model to work in production, highlighting specific word translations in different contexts
- Planned launch date - September 2021
- Preliminary results presented at International Symposium PaCor 2021 at the University of Basque Country (see paper abstract and slide deck)