On word alignment of Russian-Chinese parallel corpora

On June 24, 2021, my co-authors and I presented a paper on word alignment at the International Symposium PaCor 2021 at the University of Basque Country. Below is the paper abstract.

Olga BONETSKAYA HSE University, Computer Science Faculty (Moscow, Russia), oabonetskaya@edu.hse.ru
Dmitry DOLGOV Independent AI researcher (Vancouver, Canada), dmitry.m.dolgov@gmail.com
Maria FROLOVA Independent researcher, frolovam@bk.ru
Anastasia POLITOVA School of Liberal Arts, Nanjing University (Nanjing, China), anastasia_politova@mail.ru
Anna PYRKOVA HSE University, School of Asian Studies (Moscow, Russia), aipyrkova@edu.hse.ru

ABSTRACT

Word alignment of parallel corpora is defined as finding word-to-word relationships between bitexts already aligned on a sentence level. It is a fundamental task in both natural language processing and linguistics. In the former, it is essential for a range of downstream tasks like knowledge transfer from high-resource to low-resource languages, namely projecting input formatting and annotations, and also for evaluation of machine translation quality. In the latter, it is of great help to the learners of foreign languages. Beginners can search for clear examples of a single word or grammar usage. Translators may check translations of the problematic idioms or collocations and choose a better solution.

The current study aims at developing a word-aligned Russian-Chinese dataset, evaluating and comparing various machine learning algorithms that can produce word-to-word alignment automatically, and developing a web tool to identify word/expression translations in context within a Russian-Chinese parallel corpus - RuZhCorp (Durneva et al., 2020). RuZhCorp was created in 2016 by a team of sinologists and computational linguists as a branch of the Russian National Corpus. Having got its independent version at the Higher School of Economics (Moscow) website in 2019, so far, RuZhCorp has accumulated 1 070 texts and 3.5 million words both in Russian and Chinese. Text types include fiction, news, and official documents, but as currently, 80% of the corpus is fiction, the present study focuses primarily on the fiction genre.

Building a dataset, we initially developed an alignment manifesto, a set of high-level principles stipulating that alignment should be based on word representation in the discussed languages and not on the context. In other words, only tokens with clear semantic correspondence in both languages should be aligned, while those added or omitted due to context necessity, literal purposes or to prevent tautology in a language should not. The manifesto itself has been based on existing works and approaches in the field of word alignment. We consulted the experience of manual word alignments over six language pairs (combinations between Portuguese, English, French and Spanish) described in the paper “Building a golden collection of parallel Multi-Language Word Alignments“ (Graça et al., 2008). The outcome comprised one hundred examples for each language pair extracted from Europarl Corpus. Subsequently, the researchers have compiled a detailed manual alignment guideline that assumes division into sure and possible alignments, and this experience was adopted by our team. Additionally, our manual alignment approach includes some unique features. We use Google Sheets instead of the Annotation tool software designated in (Graça et al., 2008). Then, similarly to the predecessors, we defined a set of specific rules to achieve a better representation of searched words in the bilingual corpora. The classifiers, which are widely used in the Chinese language, were one of the stumbling blocks in the alignment process; therefore, we have set that only those that have a clear match in the corresponding language are to be linked. Similar rules were developed for prepositions, auxiliary verbs, modal particles, etc. Therefore, we both proposed a theoretical roadmap for the unification of the word alignment principles in other language pairs and elaborated the precise alignment rules for the Russian-Chinese language pair.

We have created a framework for manual annotation of the parallel corpora with word alignments that allows capturing many-to-many relationships in a way that is easily understandable both by humans and machines. This framework distinguishes between sure and possible alignment links and incorporates an iterative peer-review process to guarantee the uniform quality of the resulting dataset. The framework was applied to 125 randomly selected sentence pairs that were used for the evaluation of artificial intelligence algorithms. By analogy with the previous research, we used variable conditional designations for alignment points such as “1” (i.e. sure alignment point for existing machine alignment), “2” (i.e. possible alignment point for existing machine alignment), “n” and “p” (for similar cases when there is a correspondence between the word pair that has been omitted by machine alignment), and “q” (i.e. questionable alignment point that is to be discussed) performed by the first annotator. In case there was the machine markup for the non-corresponding word pairs, a first annotator should put the “d” letter meaning deletion of an inappropriate alignment point. A second annotator performs the second iteration and puts “11” and “22” for sure and possible alignment points relatively and deletes “d”s if he or she agrees with the first annotator, otherwise “q” should be put. Such a two-step iteration process allows for a relatively high rate of consistency.

As for the programming part of the alignment, historically automatic word alignment was mostly done using statistical methods. The expectation-maximization algorithm was first proposed by Dempster et al. (1977) and implemented for word alignment under the name of IBM models by Brown et al. (1993). Och and Ney (2003) created a tool called GIZA++ that remains a common benchmark till now. Later, several deep learning approaches had been proposed. Yang et al. (2013) used a DNN (deep neural network) to discriminatively learn bilingual word embeddings. Bahdanau, Cho and Bengio (2015) used a DNN to jointly learn to align and translate. Stengel-Eskin et al. (2019) used supervised learning to extract alignments from the attention module of a Transformer DNN.

Using our novel dataset, we have selected, reparameterized, applied, and compared several machine alignment approaches. The main method was extracting alignment from contextualized word embeddings (Dou and Neubig, 2021) of deep learning language models like BERT and LaBSE with further fine-tuning on the parallel corpus. BERT (Bidirectional Encoder Representations from Transformers) is a language model pre-trained on unlabeled monolingual texts in different languages (Devlin et al., 2018). LaBSE (Language-agnostic BERT Sentence Embedding) is a model that combines masked language model and translation language model that is trained on parallel data (Feng et al., 2020). IBM model 1 and reparametrization of IBM model 2 Fast Align (Dyer et al.) were used as a statistical benchmark that neural models were evaluated.

In our study, we have evaluated word alignments that can be extracted directly from publicly available versions of BERT and LaBSE. Then, we additionally trained those models on the Russian-Chinese parallel corpus (~700,000 sentence pairs). As a final step of the research, we fine-tuned those models on manually annotated gold standard data (351 sentence pairs more) marked manually by human annotators. To compare the quality of the models, we use the AER (Alignment Error Rate) metric introduced by Och and Ney (2000). All RuZhCorp texts (3.5 million words) are used for training. When training on annotated sentences from the novel dataset, AER is calculated on the rest of the dataset (test set of 102 sentence pairs).

As of now, our project team has compiled a fully developed parallel corpus of 453 sentence pairs along with over 10 well-established and proven rules for Russian-Chinese and Chinese-Russian word alignment. When comparing algorithms on this dataset after fine-tuning, LaBSE achieves the best AER of 18.9% and BERT follows with AER of 28.3%. Due to the lack of previous works on Russian-Chinese word alignment, we have compared our results with the results for other comparable pairs of non-similar languages that include one European and one East-Asian language. Li et al. (2019) list 36.57% as their best result for Chinese-English; Dou and Neibig (2021) show AER of 37.4% for Japanese-English while providing a much lower AER of 13.9% for Chinese-English. The data shows that we are in line with previous research on similar language pairs; in fact, our results may become a valuable benchmark for future research on Russian-Chinese word alignment. They also allow for the development of the above-mentioned tool for language scholars.

References

Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
Peter F. Brown, Stephen A. Della-Pietra, Vincent J. Della-Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation. Computational Linguistics, 19(2):263-311.
Arthur P. Dempster , Nan M. Laird, and Donald B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1-38.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. arXiv:2101.08231
Sofia. P. Durneva, Yulia N. Kuznetsova, and Kirill I. Semenov. 2020. Russian-Chinese Parallel Corpus of RNC: Problems and Perspectives. Proceedings of the 10th International Conference “Russia and China: history and perspectives for cooperation”, 633–640. (София П. Дурнева, Юлия Н. Кузнецова, Кирилл И. Семенов. 2020. Русско-китайский параллельный корпус НКРЯ: проблемы и перспективы. X Международная научно-практическая конференция «Россия и Китай: история и перспективы сотрудничества», 633-640.)
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. Language-agnostic BERT sentence embedding. arXiv:2007.01852.
João Graça, Joana Paulo Pardal, Luisa Coheur, and Diamantino Caseiro. 2008. Building a golden collection of parallel Multi-Language Word Alignments. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08).
Xintong Li, Guanlin Li, Lemao Liu, Max Meng & Shuming Shi. 2019. On the word alignment from neural machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1293–1303.\
Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 440–447.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Elias Stengel-Eskin, Tzu-Ray Su, Matt Post, and Benjamin Van Durme. 2019. A discriminative neural model for cross-lingual word alignment. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 910–920.
Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1), 166-175.
Chris Dyer, Victor Chahuneau, Noah A. Smith. 2013. A Simple, Fast, and Effective Reparameterization of IBM Model 2. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 644-648.