Do we need very large corpora?

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	PALA Karel RYCHLÝ Pavel
Year of publication	2011
Type	Article in Proceedings
MU Faculty or unit	Faculty of Informatics
Citation
Field	Informatics
Keywords	corpora, corpus tools
Description	In the paper we are dealing with building very large corpora from Web. First, we discuss motivation and needs for this kind of resources both for linguists, lexicographers, and NLP specialists. Second, we mention the techniques used for building large (more than billion tokens) corpora and present the results obtained at NLP Centre FI MU, i.e. both tools and corpora. Then we pay attention to the analysis of the consequences following from building large text data resources and the ways in which they are used in corpus linguistics and various NLP applications.
Related projects:	Centrum komputační lingvistiky