Effects of Selected Basic Algorithm Parameters and Data Features on Text Categorization by Support Vector Machines

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	HUDÍK Tomáš ŽIŽKA Jan
Year of publication	2005
Type	Article in Proceedings
Conference	Znalosti 2005, sborník příspěvků
MU Faculty or unit	Faculty of Informatics
Citation
Field	Informatics
Keywords	text categorization; support vector machines
Description	This paper describes results acquired from testing influences of selected important parameters of Support Vector Machines (SVM) applied to text categorization. The main object was to verify whether results obtained with standard, publicly accessible datasets (the traditional Reuters text documents and the 20Newsgroups) could be applied to real medical text documents from various Internet resources utilized by physicians. The research also focused on features as document similarity, balance of categories, presence of common words (stop-words), and data volume. The results of experiments demonstrated that there could be typical problems with setting up parameters for some real data. Especially the medical documents provided worse outcomes because the real-data categories were not well balanced and the documents in different categories were mutually rather similar-i.e., overlapping classes. As a result, SVM could not always find sufficiently good separating hyperplanes as it mostly did for `trouble-free' datasets like Reuters or 20Newsgroups.
Related projects:	Human-computer interaction, dialog systems and assistive technologies