Building Corpora for Stylometric Research
Autoři | |
---|---|
Rok publikování | 2016 |
Druh | Článek ve sborníku |
Konference | Text, Speech, and Dialogue - 19th International Conference |
Fakulta / Pracoviště MU | |
Citace | |
Doi | http://dx.doi.org/10.1007/978-3-319-45510-5_3 |
Obor | Informatika |
Klíčová slova | corpus; stylometry; authorship; crawler |
Popis | Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. |