ALGORITHMS FOR AUTOMATIC EXTRACTION OF DOMAIN TERMS IN BILINGUAL PARALLEL TEXTS AND IDENTIFYING THEIR SEMANTIC EQUIVALENCE
Keywords:
parallel corpora, mono- and multilingual embeddings, neural approaches, bilingual terms, alignment, semantic equivalent alignment.Abstract
This article proposes an integrated algorithmic framework for automatic term extraction (ATE) and the alignment of their semantic equivalents (bilingual term alignment / bilingual lexicon induction) in bilingual parallel and comparable corpora. We integrate traditional statistical and morphological methods (C-value, TF–IDF, Alban) with modern neural approaches (mono- and multilingual embeddings, contextual transformer models, and word alignment). The experimental section provides an evaluation based on precision, recall, and MAP metrics using parallel corpora and domain-specific comparable corpora.
References
1. Rigouts Terryn, A., Hoste, V., Lefever, E. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Language Resources and Evaluation. 2019. – P. 12-20.
2. Jiaji Huang, Xingyu Cai, Kenneth Church. Improving Bilingual Lexicon Induction for Low Frequency Words. EMNLP 2020. – P. 45-58.
3. Chris Dyer, Victor Chahuneau, Noah A. Smith. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2 (fast_align). 2013. – P. 178.
4. Jingshu Liu, Emmanuel Morin, Peña Saldarriaga. Towards a unified framework for bilingual terminology extraction of single-word and multi-word terms. COLING. 2018. – P. 34.
5. Véronique Hoste. In no uncertain terms (dataset paper). awesome-align, neural aligner based on mBERT. 2019. – P. 57.