HFT: High Frequency Tokens for Low-Resource NMT

Signoroni,  Edoardo; Rychlý,  Pavel

HFT: High Frequency Tokens for Low-Resource NMT

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	SIGNORONI Edoardo RYCHLÝ Pavel
Rok publikování	2022
Druh	Článek ve sborníku
Konference	Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://aclanthology.org/2022.loresmt-1.8
Klíčová slova	Machine Translation; Tokenization
Popis	Tokenization has been shown to impact the quality of downstream tasks, such as Neural Machine Translation (NMT), which is susceptible to out-of-vocabulary words and low frequency training data. Current state-of-the-art algorithms have been helpful in addressing the issues of out-of-vocabulary words, bigger vocabulary sizes and token frequency by implementing subword segmentation. We argue, however, that there is still room for improvement, in particular regarding low-frequency tokens in the training data. In this paper, we present “High Frequency Tokenizer”, or HFT, a new language-independent subword segmentation algorithm that addresses this issue. We also propose a new metric to measure the frequency coverage of a tokenizer’s vocabulary, based on a frequency rank weighted average of the frequency values of its items. We experiment with a diverse set of language corpora, vocabulary sizes, and writing systems and report improvements on both frequency statistics and on the average length of the output. We also observe a positive impact on downstream NMT.
Související projekty:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy Interní grantová agentura Masarykovy univerzity A New Machine Translation-based approach to Parallel Corpora Alignment