Words’ Burstiness in Language Models
Autoři | |
---|---|
Rok publikování | 2011 |
Druh | Článek ve sborníku |
Konference | Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 |
Fakulta / Pracoviště MU | |
Citace | |
www | https://nlp.fi.muni.cz/raslan/2011/paper17.pdf |
Obor | Jazykověda |
Klíčová slova | Burstiness; Language models; Words' probability |
Popis | Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model. |
Související projekty: |