The TenTen Corpus Family

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

JAKUBÍČEK Miloš KILGARRIFF Adam KOVÁŘ Vojtěch RYCHLÝ Pavel SUCHOMEL Vít

Rok publikování 2013
Druh Článek ve sborníku
Konference 7th International Corpus Linguistics Conference CL 2013
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www
Popis Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever better corpora along these lines for all of the world’s major languages (plus some others). Baroni and Kilgarriff (2006), Sharoff (2006), Baroni et al (2009), and Kilgarriff et al (2010) present the case for web corpora and programmes in which a number of them have been developed. TenTens are a development from them -- a new family of corpora of the order of 10 billion words. We describe how we are building them, what we have built so far, and how we shall continue maintaining them and keeping them up to date in the years ahead. While, as yet, they have very little metadata, we are working out how to gather and add metadata attribute by attribute. The corpora are all available for research at http://www.sketchengine.co.uk.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.