csTenTen17, a Recent Czech Web Corpus

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	SUCHOMEL Vít
Year of publication	2018
Type	Article in Proceedings
Conference	Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018
MU Faculty or unit	Faculty of Informatics
Citation
Web	https://nlp.fi.muni.cz/raslan/2018/paper10-Suchomel.pdf
Keywords	Czech corpus; web corpus; text processing
Description	This article introduces a very large Czech text corpus for language research – csTenTen17 compiled from texts downloaded in 2015, 2016 and 2017. The corpus is consisting of 10.5 billion words reaching double the size of its predecessor from 2012. A brief comparison with other recent Czech corpora follows.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum