Filtering Very Similar Text Documents: A Case Study

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

HROZA Jiří ŽIŽKA Jan BOUREK Aleš

Year of publication 2004
Type Article in Proceedings
Conference Computational linguistics and Intelligent Text Processing
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords machine learning; text categorization; text filtration; text similarity
Description This paper describes problems with classification and filtration of similar relevant and irrelevant real medical documents from one very specific domain, obtained from the Internet resources. Besides the similarity, the documents are often unbalanced-a lack of irrelevant documents for the training. A definition of similarity is suggested. For the classification, six algorithms are tested from the document similarity point of view. The best results are provided by the back propagation-based neural network and by the radial basis function-based support vector machine.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.