Finding Terms in Corpora for Many Languages with the Sketch Engine
Authors | |
---|---|
Year of publication | 2014 |
Type | Article in Proceedings |
Conference | Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics |
MU Faculty or unit | |
Citation | |
web | Plný text výsledku |
Field | Informatics |
Keywords | terminology; terms; corpora; sketch engine |
Description | Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above. |
Related projects: |