Lookin up photos in the cloud can be cumbersome and daunting, especially if you have forgotten the filename. To simplify the experience, file hosting platform Dropbox recently launched a new feature to make image search seamless.
Currently, Dropbox serves close to 700 million registered users, 15.48 million paying users and generated close to $1.91 billion in revenue in 2020. In the coming months, the company is looking to launch a search feature for video content.
Image content search
Typically, when a user searches for images on the cloud storage platforms, they endlessly scroll through a pile of photos until they spot them or try their luck in guessing the filename. However, Dropbox’s new ‘image search’ feature suggests relevant images and calls out the best match, based on a few descriptive words.
For instance, if you are looking for photos from a picnic, you can type in the keyword ‘picnic’ or other objects in the image that you can vaguely recall/remember. In no time, the relevant images would be shown.
Example of Image content search results for ‘picnic’ (Source: Dropbox)
But, how does this work?
Dropbox leverages machine learning techniques to improve the image content search. In this article, we will discuss the model in detail and explain how Dropbox implemented this latest feature on its existing search infrastructure.
Before that, let’s take a look at a simple image search problem:
For any image search problem, to find a relevance function, we need a (text) query ‘q’ and an image ‘j,’ which returns a relevance score ‘s,’ indicating how well the image matches the search query.
s = f(q, j) … (1)
When a user searches for images, the above function is made to run on all their pictures, and it returns those images that produce a score above a threshold. In the case of Dropbox, it has built this function using two ML techniques — accurate image classification and word vectors.
Image classification & word vectors
In image classification, the image classifier reads an image, outputs a scored list of categories that describe its contents, and higher scores indicate a higher probability that the image belongs to that category.
For instance, categories can be classified as
- Specific object in the image (a tree, or a person, etc.)
- Overall scene descriptions (outdoors, wedding, seminars, etc)
- Characteristics of the image (black-and-white, dark background, blue sky, etc.)
Today, image classification in machine learning has improved a whole lot. From the latest self-supervised models to large datasets like Open Images or ImageNet, and easy-to-use libraries and frameworks such as Google’s Tensorflow and Facebook’s PyTorch, researchers have built image classifiers that can recognise thousands of categories accurately.
Example of Image classifier results for a typical unstaged image (Source; Dropbox)
While image classifiers let users understand what’s in an image, this isn’t enough to enable search. To further enhance this, Dropbox has used ‘word vectors.’ Word vectors are nothing but vectors of numbers representing the meaning of a word and usually defined as jc, where ‘c’ represents the number of categories (several thousand).
Citing Mikolov et al.’s 2013 word2vec paper, Dropbox said word2vec assigns a vector to each word in the dictionary, and words with similar meanings will have vectors close to each other. Dropbox seems to have taken inspiration from this research paper for its image search machine learning architecture.
Dropbox has used the EfficientNet network image classifier, trained on the OpenImages dataset, which roughly produces scores for about 8500 categories. “We have found that this architecture and dataset give good accuracy at a reasonable cost,” claimed Dropbox.
Besides these, Dropbox is also using Tensorflow to train and run the model. Also, it is using the pre-trained ConceptNet Numberbatch word vectors. “These give good results, and important to us they support multiple languages, returning similar vectors for words in different languages with similar meanings,” said Dropbox, stating that this makes supporting image content search in multiple languages easy.
For instance, word vectors for ‘dog’ in English and ‘chien’ in French are similar. Dropbox claimed that it could support search in both languages without having to perform a direct translation. Similarly, for multi-word inquiries, the algorithm performs an alternate parse and runs the OR of the two parsed queries. For example, the query ‘beach ball’ becomes (beach AND ball) OR (beach ball). The results show both.
The above model was applied on existing Dropbox’s Nautilus search engine instead of instantiating ‘j’ for each query, which could have billions of entries and needs to be updated whenever a user deletes, adds or modifies an image.
Nautilus consists of a forward index that maps each file to metadata (the filename) and the full text of the file. The text-based search for the same would look something like this:
Search index contents for text-based search (Source: Dropbox)
With the latest image content search architecture, Dropbox can use the same system to implement image search algorithms. For instance, in the forward index, each image’s category space vector jc can be stored. In contrast, the inverse index for each category posts a list of images with positive scores for that category. It looks like this:
Search index contents for image search (Source: Dropbox)
Is this scalable?
Dropbox believes the ‘text-search’ approach is still expensive in terms of storage space and query-time processing. “If we have 10,000 categories, then for each image we have to store 10,000 classifier scores in the forward index, at the cost of 40 kilobytes if we use four-byte floating-point values,” explained Dropbox. The classifier scores are rarely zero, and will be added to most of those 10,000 posting lists.
In other words, for many images, the index storage would be larger than the image file.
However, Dropbox said many near-zero values could drop to get a much more efficient approximation in the case of ‘image search.’ The company said the storage and processing savings are substantial.
Here’s a comparison between ‘image search’ and ‘text search’
- Instead of 10,000-dimensional dense vectors, the system stores sparse vectors with 50 nonzero entries in the forward index. A sparse vector is a matrix in which most of the elements are zero. In this case, about 50 two-byte integer positions and 50 four-byte float values require about 300 bytes.
- In the inverted index, each image is added to 50 posting lists instead of 10,000, at the cost of about 200 bytes. Therefore, the total index storage per image is 500 bytes instead of 80 kilobytes.
- In terms of query time, the image categories have ten nonzero entries, and only need to scan 10 posting lists, roughly the same amount of work as that of text queries. That, in a way, gives a smaller result set, which can score more quickly.
Both indexing and storage costs are reasonable, while query latencies are on par with those for text search. In other words, the user can run both text and image searches in parallel, and the search engine would show the complete set of results together as fast as a text-only search.
Join Our Discord Server. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Amit Raja Naik is a senior writer at Analytics India Magazine, where he dives deep into the latest technology innovations. He is also a professional bass player.