Practical Web Crawling for Text Corpora

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	SUCHOMEL Vít POMIKÁLEK Jan
Year of publication	2011
Type	Article in Proceedings
Conference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2011/paper09.pdf
Field	Informatics
Keywords	crawler; web crawling; corpus; web corpus; text corpus
Description	SpiderLing--a web spider for linguistics--is new software for creating text corpora from the web, which we present in this article. Many documents on the web only contain material which is not useful for text corpora, such as lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. We present our preliminary results from creating Web corpora of texts in Czech and Tajik.
Related projects:	Centrum komputační lingvistiky Pattern Recognition-based Statistically Enhanced MT Temporální aspekty znalostí a informací