Detecting Co-Derivative Documents in Large Text Collections
Authors | |
---|---|
Year of publication | 2008 |
Type | Article in Proceedings |
Conference | Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) |
MU Faculty or unit | |
Citation | |
Web | http://www.lrec-conf.org/lrec2008/ |
Field | Informatics |
Keywords | Detecting; Large Text Collections |
Description | We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size. |
Related projects: |