Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis
Authors | |
---|---|
Year of publication | 2024 |
Type | Article in Proceedings |
Conference | International Conference on Text, Speech, and Dialogue |
MU Faculty or unit | |
Citation | |
Web | Preprint version |
Doi | http://dx.doi.org/10.1007/978-3-031-70563-2_3 |
Keywords | bilingual lexicon induction; cross-lingual word embeddings; neural machine translation systems |
Description | Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources. |
Related projects: |