Filtering Very Similar Text Documents: A Case Study

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	HROZA Jiří ŽIŽKA Jan BOUREK Aleš
Year of publication	2004
Type	Article in Proceedings
Conference	Computational linguistics and Intelligent Text Processing
MU Faculty or unit	Faculty of Informatics
Citation
Field	Informatics
Keywords	machine learning; text categorization; text filtration; text similarity
Description	This paper describes problems with classification and filtration of similar relevant and irrelevant real medical documents from one very specific domain, obtained from the Internet resources. Besides the similarity, the documents are often unbalanced-a lack of irrelevant documents for the training. A definition of similarity is suggested. For the classification, six algorithms are tested from the document similarity point of view. The best results are provided by the back propagation-based neural network and by the radial basis function-based support vector machine.
Related projects:	Human-computer interaction, dialog systems and assistive technologies