CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors
Authors | |
---|---|
Year of publication | 2023 |
Type | Article in Proceedings |
Conference | 16th International Conference on Similarity Search and Applications (SISAP) |
MU Faculty or unit | |
Citation | |
web | https://link.springer.com/chapter/10.1007/978-3-031-46994-7_26 |
Doi | http://dx.doi.org/10.1007/978-3-031-46994-7_26 |
Keywords | approximate similarity searching;high-dimensional data;indexing;filtering;LAION dataset |
Description | Recent advances in cross-modal multimedia data analysis necessarily require efficient similarity search on the scales of hundreds of millions of high-dimensional vectors. We address this task by proposing the CRANBERRY algorithm that specifically combines and tunes several existing similarity search strategies. In particular, the algorithm: (1) employs the Voronoi partitioning to obtain a query-relevant candidate set in constant time, (2) applies filtering techniques to prune the obtained candidates significantly, and (3) re-rank the retained candidate vectors with respect to the query vector. Applied to the dataset of 100 million 768-dimensional vectors, the algorithm evaluates 10NN queries with 90% recall and query latency of 1.2s on average, all with a throughput of 15 queries per second on a server with 56 core-CPU, and 4.7q/sec. on a PC. |