Blooming Onion: Efficient Deduplication through Approximate Membership Testing

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

HERMAN Ondřej

Year of publication 2022
Type Article in Proceedings
Conference Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022
MU Faculty or unit

Faculty of Informatics

Citation
web
Keywords deduplication; text corpora; Bloom filter
Description Deduplication of source text is an important step in corpus building. Maximum corpus sizes have been grown significantly, along with the requirements for computing resources required for processing them. This article explores reducing the cost of deduplication by applying approximate membership testing using Bloom filtering.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.