Scaling to Billion-plus Word Corpora
Authors | |
---|---|
Year of publication | 2009 |
Type | Article in Periodical |
Magazine / Source | Advances in Computational Linguistics |
MU Faculty or unit | |
Citation | |
Field | Informatics |
Keywords | word corpora; web as corpus; duplicate detection |
Description | Most phenomena in natural languages are distributed in accordance with Zipf's law, so many words, phrases and other items occur rarely and we need very large corpora to provide evidence about them. Previous work shows that it is possible to create very large (multi-billion word) corpora from the web. The usability of such corpora is often limited by duplicate contents and a lack of efficient query tools. This paper describes BiWeC, a Big Web Corpus of English texts currently comprising 5.5b words fully processed, and with a target size of 20b. We present a method for detecting near-duplicate text documents in multi-billion-word text collections and describe how one corpus query tool, the Sketch Engine, has been re-engineered to efficiently encode, process and query such corpora on low-cost hardware. |
Related projects: |