Building Corpora for Stylometric Research

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

ŠVEC Ján RYGL Jan

Rok publikování 2016
Druh Článek ve sborníku
Konference Text, Speech, and Dialogue - 19th International Conference
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
Doi http://dx.doi.org/10.1007/978-3-319-45510-5_3
Obor Informatika
Klíčová slova corpus; stylometry; authorship; crawler
Popis Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.