Effects of Selected Basic Algorithm Parameters and Data Features on Text Categorization by Support Vector Machines

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

HUDÍK Tomáš ŽIŽKA Jan

Year of publication 2005
Type Article in Proceedings
Conference Znalosti 2005, sborník příspěvků
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords text categorization; support vector machines
Description This paper describes results acquired from testing influences of selected important parameters of Support Vector Machines (SVM) applied to text categorization. The main object was to verify whether results obtained with standard, publicly accessible datasets (the traditional Reuters text documents and the 20Newsgroups) could be applied to real medical text documents from various Internet resources utilized by physicians. The research also focused on features as document similarity, balance of categories, presence of common words (stop-words), and data volume. The results of experiments demonstrated that there could be typical problems with setting up parameters for some real data. Especially the medical documents provided worse outcomes because the real-data categories were not well balanced and the documents in different categories were mutually rather similar-i.e., overlapping classes. As a result, SVM could not always find sufficiently good separating hyperplanes as it mostly did for `trouble-free' datasets like Reuters or 20Newsgroups.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.