Effects of Selected Basic Algorithm Parameters and Data Features on Text Categorization by Support Vector Machines
Authors | |
---|---|
Year of publication | 2005 |
Type | Article in Proceedings |
Conference | Znalosti 2005, sborník příspěvků |
MU Faculty or unit | |
Citation | |
Field | Informatics |
Keywords | text categorization; support vector machines |
Description | This paper describes results acquired from testing influences of selected important parameters of Support Vector Machines (SVM) applied to text categorization. The main object was to verify whether results obtained with standard, publicly accessible datasets (the traditional Reuters text documents and the 20Newsgroups) could be applied to real medical text documents from various Internet resources utilized by physicians. The research also focused on features as document similarity, balance of categories, presence of common words (stop-words), and data volume. The results of experiments demonstrated that there could be typical problems with setting up parameters for some real data. Especially the medical documents provided worse outcomes because the real-data categories were not well balanced and the documents in different categories were mutually rather similar-i.e., overlapping classes. As a result, SVM could not always find sufficiently good separating hyperplanes as it mostly did for `trouble-free' datasets like Reuters or 20Newsgroups. |
Related projects: |