Building Corpora for Stylometric Research

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	ŠVEC Ján RYGL Jan
Rok publikování	2016
Druh	Článek ve sborníku
Konference	Text, Speech, and Dialogue - 19th International Conference
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
Doi	http://dx.doi.org/10.1007/978-3-319-45510-5_3
Obor	Informatika
Klíčová slova	corpus; stylometry; authorship; crawler
Popis	Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.