We describe an approach to generalize the concept of text-based search to visual information. We show that images containing specific objects can be retrieved from videos and image collections with the ease, speed and accuracy with which Google retrieves web pages containing particular words, by specifying the query as an image.
In the visual case we are faced with a number of additional problems over those of text retrieval: the object in a target image may appear at a different viewpoint or under different illumination or at a different size to that specified in the query image. Nevertheless, methods from text retrieval, such as posting lists, can be applied for efficient real-time retrieval by developing a visual analogy of a word.
We will describe a scalable method for building the visual vocabulary using a quantization based on randomized trees. We show, using an extensive ground-truth, that re-ranking the images using spatial consistency with the query image consistently improves search quality.
We also show that query expansion can be successfully ported from text to the visual domain, and can substantially improve recall.
We will demonstrate these methods on retrieval from various films and on a dataset of over 1 million images crawled from the photo-sharing site Flickr.
Joint work with Ondrej Chum, Michael Isard, James Philbin and Josef Sivic.