Scaling to Billion-plus Word Corpora

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	POMIKÁLEK Jan RYCHLÝ Pavel KILGARRIFF Adam
Year of publication	2009
Type	Article in Periodical
Magazine / Source	Advances in Computational Linguistics
MU Faculty or unit	Faculty of Informatics
Citation
Field	Informatics
Keywords	word corpora; web as corpus; duplicate detection
Description	Most phenomena in natural languages are distributed in accordance with Zipf's law, so many words, phrases and other items occur rarely and we need very large corpora to provide evidence about them. Previous work shows that it is possible to create very large (multi-billion word) corpora from the web. The usability of such corpora is often limited by duplicate contents and a lack of efficient query tools. This paper describes BiWeC, a Big Web Corpus of English texts currently comprising 5.5b words fully processed, and with a target size of 20b. We present a method for detecting near-duplicate text documents in multi-billion-word text collections and describe how one corpus query tool, the Sketch Engine, has been re-engineered to efficiently encode, process and query such corpora on low-cost hardware.
Related projects:	Centrum komputační lingvistiky Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce