There is a new approach to tackling the problem of training computer for ‘extreme classification problems’ like answering the general questions — the Merged-Average Classifiers vis Hashtag (MACH) approach. This divide-and-conquer approach to machine learning can cut the time and computational resources required in dealing with the extreme classification problems.
In a large dataset, like queries on Amazon by the online shoppers, for example, which included around 70 million queries for more than 49 million products, the researchers have showcased their approach of using the MACH. This method required a fraction of the report of any state-of-the-art commercial system.
The researchers Anshumali Shrivastava and lead author Tharun Medini who are both from Rice University introduced their paper in the NeurIPS 2019 in Vancouver.
Using Machine Learning for Better Search
Taking online shopping as an example here, one of the biggest challenges when it comes to online product search is the sheer number of products. There are millions of people shopping, using millions of words for online products which exceed the number of 100 million products online. These online shoppers who are shopping online by their own will or have been led to the platform through a ‘Dark Pattern‘, enter queries in their own way. Some use questions, some use the keywords, and some of them aren’t sure what they are looking for. Maybe it’s safe to say many of us lie in the ‘unsure’ category who go on searching on these online shopping websites and provide a massive number of data for the companies like Amazon, Google and Microsoft. This data of successful and unsuccessful searches can be used for deep learning, which is an effective way to train it to give out better results to the users.
How Extreme Classification Problems are a Hassle
The neural network models are a vast collection of mathematical equations. These deep learning systems take in a set of numbers called input vectors and transform them into a different set of numbers called output numbers. The neural networks are composed of several parameters, and the state-of-the-art distributed deep learning system contains billions of parameters. Thes parameters are divided into layers. Now, during training, each dataset fed to the first layer, vectors are transformed, and outputs are fed to the next layer — and this process goes on.
These ‘extreme classification problems’ have many possible outcomes, and the deep learning models for these are so large that they might require supercomputers and a linked set of GPUs. These parameters are branched out and are run in parallel and even with supercomputers and GPUs, this processing might take several days.
One of the researchers, Tharun Medini, says, “I will need 1.5 terabytes of working memory just to store the model. I haven’t even gotten to the training data. The best GPUs out there have only 32 gigabytes of memory, so training such a model is prohibitive due to massive inter-GPU communication.”
MACH takes a different approach when it comes to extreme classification problems. The lead researcher Anshumali Shrivastava describes it with an experiment where he distributed the 100 million products into three classes and these classes take the form of buckets. Shrivastava says it’s a ‘drastic reduction from 100 million three’, yep, we agree!
Shrivastava said, “I’m mixing, let’s say, iPhones with chargers and T-shirts all in the same bucket,” he says. Here, the 100 million products are sorted randomly into three buckets. These three buckets are sorted randomly in two different worlds, which means the product can wind up in three different buckets in each world. So, when a classifier is assigned to searches in a bucket, it is trained to search the bucket, not a specific product. This means the classifier only needs to map a search to one of the three classes of product.
Say, you feed a search to the classifier in world one, and it matches it a bucket in it, say bucket number 2. And then the same search is fed to the classifier in world two, where it says bucket number 1. Now, when you think about it, the most probable class is the one common between these two buckets, bucket number 2 in world one and the bucket number 1 in world two. Now keep in mind that about 100 million products are in these six buckets are randomly sorted in two different worlds. So, if you think about the possible intersections of the buckets, there are three in world one times three in world two (33) that is nine possibilities. One might observe as Shrivastava says,” So I have reduced my search space to one over nine, and I have only paid the cost of creating six classes.”
This concept can be extended by adding a third world and three more buckets which increase the number of possible interactions by a factor of three. This way, one is paying for a less amount of classes for a massive improvement.
Now, in the experiment carried by the lead researcher Anshumali Shrivastava and lead author Tharun Medini with Amazon’s training database, they randomly divided the 49 million products into 10,000 buckets and repeated the process 32 times. And the staggering result was that the 100 billion parameters in the model were reduced to 6.4 billion in less time with less memory usage. These results in time and memory efficiency were better than some of the best-reported training times with comparable parameters including Google’s Sparsely-Gates Mixture-of-Experts (MoE) model, says Medini.
The biggest advantage MACH holds is that there is no communication between the parallel processors. Medini says, “In principle, you could train each of the 32 on one GPU, which is something you could never do with a nonindependent approach.” The reduction in communication is a breakthrough for distributed deep learning.