Slavonic Corpus for Stylometry Research
Autoři | |
---|---|
Rok publikování | 2015 |
Druh | Článek ve sborníku |
Konference | Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. |
Fakulta / Pracoviště MU | |
Citace | |
www | |
Obor | Informatika |
Klíčová slova | stylometry; slavonic corpus; web structure detection; corpora building |
Popis | Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. |
Související projekty: |