LEMPAS: A Make-Do Lemmatizer for the Swedish PAROLE-Corpus
Authors | |
---|---|
Year of publication | 2006 |
Type | Article in Periodical |
Magazine / Source | Prague Bulletin of Mathematical Linguistics |
MU Faculty or unit | |
Citation | |
Field | Informatics |
Keywords | LEMPAS; PAROLE; Swedish; lemmatizer; rule-based |
Description | LEMPAS, the lemmatizer for the Swedish corpus PAROLE, came into existence as a by-product of running the Sketch Engine (Kilgarriff et al.) on Swedish, since many of the desirable features of the Sketch Engine, such as building word sketches, are only available for lemmatized corpora. We did not have access to any Swedish lexical sources and the time allowed for the lemmatization was very limited. Consequently, the lemmatizer had no great design ambitions. Initially, we were only attempting to bring related forms together under a pre-lemma, using general rules, and avoiding explicit lists where possible. When the initial rules gave surprisingly good lemmatizations of nouns, verbs and adjectives, we decided to transform the pre-lemmas into real lemmas. The improved lemmatizer made a very good impression. We have tested the program on the manually lemmatized Stockholm-Umea Corpus (SUC), and have analyzed the results. |
Related projects: |