During the training of a neural network, the network weights get adjusted in order to perfectly learn the representations in the training data. If the size of training data is very large, it causes many issues in terms of computational costs. Word2Vec is a neural network-based model that is popularly used in natural language processing applications. When the size of training data increases excessively, the word2vec models face the issues. To address this issue, there is an approach called negative sampling used with word2vec models which allows only a small percentage of network weights to get modified during training. In this post, we will discuss the negative sampling technique used by word2vec models in detail. The major points to be covered in this article are listed below.
Table of Contents
- The Word2Vec Model
- Problem with Word2Vec Models
- The Objective of the Negative Sampling
- How does Negative Sampling Work?
- How to Select Negative Samples
First of all, we will quickly have a look at the Word2vec model and understand the need for negative sampling.
The Word2Vec Model
Word2Vec is a technique to get a good quality word embedding from a corpus. CBOW and Skip-gram are the two well known word2vec models. There are a few features that both models have in common. The training examples were made up of a pair of terms chosen for their proximity in occurrence. The network’s final layer uses a softmax function. These models can be illustrated with the following representation.
The principle behind word2vec models is that words in comparable contexts (near each other) should have similar word vectors. As a result, when training the model, we should incorporate some notion of similarity. When vectors are similar, their dot product is greater, hence this is done with the dot product.
Problem with Word2Vec Models
Consider the scenario where we have word vectors with 400 components and a vocabulary of 10,000 words. The neural network will have two weight matrices in this case–a hidden layer and an output layer. Both of these layers would have a weight matrix containing 400 x 10,000 = 4 million weights
Gradient descent on such a huge neural network will work slowly. To make matters worse, tuning that many weights while avoiding overfitting requires a massive amount of training data. Training this model will be a beast with millions of weights multiplied by billions of training samples.
In other words, we may characterize the problem as follows: First, only the weights corresponding to the target word may receive a substantial update for each training sample. We strive to update all the weights in the hidden layer in each back-propagation pass while training a neural network model. Non-target word weights would receive just a minor or no change, implying that we only perform very sparse modifications in each run.
Second, calculating the final probability using the softmax for each training sample is quite an expensive operation because it entails a summation of scores over all the items in our lexicon for normalization.
As a result, for each training sample, we undertake an expensive operation to calculate the likelihood for words whose weight may not be modified at all or maybe updated so insignificantly that the extra expense is not justified. We strive to limit the number of weights updated for each training sample to avoid these two challenges rather than brute force our technique to build our training samples.
In their publication, the inventors of Word2Vec proposed the solution to these concerns with the following two innovations:
- To reduce the number of training examples, subsample frequently used terms.
- Using a technique called “Negative Sampling,” which causes each training sample to update only a small percentage of the model’s weights, they changed the optimization aim.
The Objective of Negative Sampling
Subsampling frequently occurring words to reduce the number of training examples. They altered the optimization aim using a technique known as “Negative Sampling”, which causes each training sample to update only a tiny percentage of the model’s weights.
How does Negative Sampling Work?
The following example displays some of the training samples (word pairs) we would select from the sentence “The quick brown fox jumps over the lazy dog.” For the sake of illustration, I’ve utilized a modest window size of 2 pixels. The word marked in blue is the input word.
A neural network is trained by taking a training sample and slightly modifying all of the neuron weights so that it more accurately predicts that training sample. In other words, each training sample will change all of the neural network’s weights. Because of the breadth of our word lexicon, our skip-gram neural network has a massive number of weights, all of which would be slightly changed by each of our billions of training examples!
Negative sampling addresses this by modifying only a small percentage of the weights, rather than all of them, in each training sample. This is how it goes. Note that the network’s “label” or “correct output” is a one-hot vector when training it on the word pair (“fox”, “quick”). That is, the output neuron associated with “quick” should output a 1, whereas the rest of the hundreds of output neurons should produce a 0.
Instead, with negative sampling, we’ll pick a small number of “negative” words (let’s say 5) at random to update the weights. (A “negative” term is one for which we want the network to output a 0 in this context.) We’ll also continue to update the weights for our “positive” term (in this case, the word “quick”). As a result, just the weights relating to them will be updated, and the loss will only be propagated back for them.
How to Select Negative Samples
The negative samples (the 5 output words that will be trained to output 0) are chosen using a unigram distribution, with more frequent words being more likely to be chosen as negative samples.
Assume you have your whole training corpus as a list of words and choose your 5 negative samples at random from that list. In this scenario, the probability of selecting “monitor” is equal to the number of times “monitor” appears in the corpus divided by the total number of words that occur in the corpus. According to the authors’ study, they attempted a number of variations on this equation, and the one that worked best was to increase the word counts to 3/4 power.
To have a better understanding of this sampling approach here I am elaborating it with a technique called adjusted sampling.
Adjusted Sampling
The fundamental sampling strategy involves selecting data points at random, however, it has a disadvantage in that high-frequency data points are selected based on the distribution.
For example, we have a backpack with three pens, ten notebooks, and one pencil. If we choose one item from the bag, the chances of getting a notebook are 0.71 (10/14), a pen is 0.21 (3/14), and a pencil is 0.07 (1/14).
We don’t want to pick high-frequency words all the time because they are less valuable than unusual ones. The power of 3/4 is applied to the probability in the word2vec implementation. For instance, the likelihood of a notebook is 0.71, which is changed to 0.77, and the probability of a pencil is converted from 0.07 to 0.14. In such instances, the likelihood of a pencil is twofold.
Conclusion
In this post, we learned how skip-gram models can be used for training word vectors and how negative sampling is utilized for this purpose. First, we understood the problem faced by word2vec models and then we discussed the negative sampling technique to address this problem. In simple terms, it is the process of approximating the softmax function by just drawing a few examples from the set of samples that do not appear in the context of the main word. This is done in order to lower the computing cost of the softmax function and it is performed across the full vocabulary.