MITB Banner

How to Use Negative Sampling With Word2Vec Model?

Word2Vec is a technique to get a good quality word embedding from a corpus. CBOW and Skip-gram are the two well known word2vec models.
Share

During the training of a neural network, the network weights get adjusted in order to perfectly learn the representations in the training data. If the size of training data is very large, it causes many issues in terms of computational costs. Word2Vec is a neural network-based model that is popularly used in natural language processing applications. When the size of training data increases excessively, the word2vec models face the issues. To address this issue, there is an approach called negative sampling used with word2vec models which allows only a small percentage of network weights to get modified during training. In this post, we will discuss the negative sampling technique used by word2vec models in detail. The major points to be covered in this article are listed below. 

Table of Contents 

  1. The Word2Vec Model
  2. Problem with Word2Vec Models
  3. The Objective of the Negative Sampling 
  4. How does Negative Sampling Work?
  5. How to Select Negative Samples

First of all, we will quickly have a look at the Word2vec model and understand the need for negative sampling.

The Word2Vec Model  

Word2Vec is a technique to get a good quality word embedding from a corpus. CBOW and Skip-gram are the two well known word2vec models. There are a few features that both models have in common. The training examples were made up of a pair of terms chosen for their proximity in occurrence. The network’s final layer uses a softmax function. These models can be illustrated with the following representation.

The principle behind word2vec models is that words in comparable contexts (near each other) should have similar word vectors. As a result, when training the model, we should incorporate some notion of similarity. When vectors are similar, their dot product is greater, hence this is done with the dot product.

Problem with Word2Vec Models

Consider the scenario where we have word vectors with 400 components and a vocabulary of 10,000 words. The neural network will have two weight matrices in this case–a hidden layer and an output layer. Both of these layers would have a weight matrix containing 400 x 10,000 = 4 million weights

Gradient descent on such a huge neural network will work slowly. To make matters worse, tuning that many weights while avoiding overfitting requires a massive amount of training data. Training this model will be a beast with millions of weights multiplied by billions of training samples.

In other words, we may characterize the problem as follows: First, only the weights corresponding to the target word may receive a substantial update for each training sample. We strive to update all the weights in the hidden layer in each back-propagation pass while training a neural network model. Non-target word weights would receive just a minor or no change, implying that we only perform very sparse modifications in each run.

Second, calculating the final probability using the softmax for each training sample is quite an expensive operation because it entails a summation of scores over all the items in our lexicon for normalization.

As a result, for each training sample, we undertake an expensive operation to calculate the likelihood for words whose weight may not be modified at all or maybe updated so insignificantly that the extra expense is not justified. We strive to limit the number of weights updated for each training sample to avoid these two challenges rather than brute force our technique to build our training samples.

In their publication, the inventors of Word2Vec proposed the solution to these concerns with the following two innovations:

  • To reduce the number of training examples, subsample frequently used terms.
  • Using a technique called “Negative Sampling,” which causes each training sample to update only a small percentage of the model’s weights, they changed the optimization aim.

The Objective of Negative Sampling 

Subsampling frequently occurring words to reduce the number of training examples. They altered the optimization aim using a technique known as “Negative Sampling”, which causes each training sample to update only a tiny percentage of the model’s weights.

How does Negative Sampling Work?

The following example displays some of the training samples (word pairs) we would select from the sentence “The quick brown fox jumps over the lazy dog.” For the sake of illustration, I’ve utilized a modest window size of 2 pixels. The word marked in blue is the input word.

A neural network is trained by taking a training sample and slightly modifying all of the neuron weights so that it more accurately predicts that training sample. In other words, each training sample will change all of the neural network’s weights. Because of the breadth of our word lexicon, our skip-gram neural network has a massive number of weights, all of which would be slightly changed by each of our billions of training examples!

Negative sampling addresses this by modifying only a small percentage of the weights, rather than all of them, in each training sample. This is how it goes. Note that the network’s “label” or “correct output” is a one-hot vector when training it on the word pair (“fox”, “quick”). That is, the output neuron associated with “quick” should output a 1, whereas the rest of the hundreds of output neurons should produce a 0.

Instead, with negative sampling, we’ll pick a small number of “negative” words (let’s say 5) at random to update the weights. (A “negative” term is one for which we want the network to output a 0 in this context.) We’ll also continue to update the weights for our “positive” term (in this case, the word “quick”). As a result, just the weights relating to them will be updated, and the loss will only be propagated back for them.

How to Select Negative Samples 

The negative samples (the 5 output words that will be trained to output 0) are chosen using a unigram distribution, with more frequent words being more likely to be chosen as negative samples.

Assume you have your whole training corpus as a list of words and choose your 5 negative samples at random from that list. In this scenario, the probability of selecting “monitor” is equal to the number of times “monitor” appears in the corpus divided by the total number of words that occur in the corpus. According to the authors’ study, they attempted a number of variations on this equation, and the one that worked best was to increase the word counts to 3/4 power.

To have a better understanding of this sampling approach here I am elaborating it with a technique called adjusted sampling.

Adjusted Sampling 

The fundamental sampling strategy involves selecting data points at random, however, it has a disadvantage in that high-frequency data points are selected based on the distribution.

For example, we have a backpack with three pens, ten notebooks, and one pencil. If we choose one item from the bag, the chances of getting a notebook are 0.71 (10/14), a pen is 0.21 (3/14), and a pencil is 0.07 (1/14).

We don’t want to pick high-frequency words all the time because they are less valuable than unusual ones. The power of 3/4 is applied to the probability in the word2vec  implementation. For instance, the likelihood of a notebook is 0.71, which is changed to 0.77, and the probability of a pencil is converted from 0.07 to 0.14. In such instances, the likelihood of a pencil is twofold.

Conclusion

In this post, we learned how skip-gram models can be used for training word vectors and how negative sampling is utilized for this purpose. First, we understood the problem faced by word2vec models and then we discussed the negative sampling technique to address this problem. In simple terms, it is the process of approximating the softmax function by just drawing a few examples from the set of samples that do not appear in the context of the main word. This is done in order to lower the computing cost of the softmax function and it is performed across the full vocabulary.

References 

PS: The story was written using a keyboard.
Picture of Vijaysinh Lendave

Vijaysinh Lendave

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed