Computer Science – Databases
Scientific paper
2011-04-16
Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 6, pp. 338-349 (2011)
Computer Science
Databases
VLDB2011
Scientific paper
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses the Locality-Sensitive-Hashing (LSH) scheme. The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates at both high and low similarity thresholds and small variance using real-world data sets.
Lee Hongrae
Ng Raymond T.
Shim Kyuseok
No associations
LandOfFree
Similarity Join Size Estimation using Locality Sensitive Hashing does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.
If you have personal experience with Similarity Join Size Estimation using Locality Sensitive Hashing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Similarity Join Size Estimation using Locality Sensitive Hashing will most certainly appreciate the feedback.
Profile ID: LFWR-SCP-O-346857