DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	HERMAN Ondřej SUCHOMEL Vít BAISA Vít RYCHLÝ Pavel
Year of publication	2016
Type	Article in Proceedings
Conference	Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
MU Faculty or unit	Faculty of Informatics
Citation
web	https://aclanthology.info/pdf/W/W16/W16-4815.pdf
Field	Informatics
Keywords	language discrimination;expectation maximization;language model
Description	In this paper we investigate two approaches to discrimination of similar languages: Expectation--maximization algorithm for estimating conditional probability P(word\|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6 % and 88.3 % on set A of the DSL Shared task 2016 competition.
Related projects:	Harvesting big text data for under-resourced languages Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.