Locality sensitive hashing (LSH) is the basis of many algorithms that use a probabilistic approach to finding nearest neighbors. It is especially important for web-scale multimedia collections because of their large size and dimensionality. We describe an algorithm for optimizing the parameters and use of LSH. Prior work ignores the issue or suggests a search for the best parameters. We start with histograms that characterize the distribution of distances to a point's nearest neighbors and the distance between a query and any point in the dataset. Given a desired performance level (the chance of finding the true nearest neighbor) and a simple computational cost model we return the LSH parameters that meet the performance goal and have the minimum computational cost. We can also use this analysis to connect LSH to deterministic nearest-neighbor algorithms such as k-d trees and thus start to unify the two approaches.
Back to Large Scale Multimedia Search