Separating Named Entities

Autoři	ULIPOVÁ Barbora GRÁC Marek
Rok publikování	2014
Druh	Článek ve sborníku
Konference	Eighth Workshop on Recent Advances in Slavonic Natural Language Processing
Fakulta / Pracoviště MU	Filozofická fakulta
Citace
www	https://nlp.fi.muni.cz/raslan/2014/15.pdf
Obor	Jazykověda
Klíčová slova	text corpus; mutual information; named entities
Popis	In this paper, we analyze the situation of long sequences of mostly capitalized words which look like a named entity but in fact they consist of several named entities. An example of such phenomena is hokejista (hockey player) New York Rangers Jaromír Jágr. Without splitting the sequence correctly, we will wrongly assume that the whole capitalized sequence is a name of the hockey player. To find out how the sequence should be split into the correct named entities, we tested several methods. These methods are based on the frequencies of the words they consist of and their n-grams. The method DIFF-2 proposed in this article obtained much better results than MI-score or logDice.
Související projekty:	Zastoupení ČR v European Research Consortium for Informatics and Mathematics Čeština v jednotě synchronie a diachronie - 2014