Guide To Word2vec Using Skip Gram Model

In natural language processing, word embedding is a term used to represent words for text analysis,

Share

Illustration by Processed with VSCO with c1 preset

Published on June 27, 2021

by Vijaysinh Lendave

Whenever you start typing on your mobile phone, writing the mail, searching some content on google, you might have seen that the next word gets suggested automatically after typing a few words. For these examples and likewise, text processing is a common part. All the scenarios deal with numerous amounts of text to perform these tasks. So how do we make today’s machine perform clustering, classification on text data since they are insufficient in handling and processing? To make machines perform these tasks, creating a representation of words that capture their meanings, semantic relationships, and different contexts is used. And this all implemented using word embeddings or numerical representation of text so that the computer can handle it.

Word Embeddings

In natural language processing, word embedding is a term used to represent words for text analysis, typically in real-valued vectors that encode the word’s meaning. The words that are closed in vector space are expected to have a similar meaning. Word embedding uses language modeling and feature learning techniques where words from the vocabulary are mapped to vectors of real numbers.

Let’s take an example, text = “The match between India and New-Zealand delayed due to rain”

From the above text, we can form the dictionary of unique words as following.

[‘The’,’match’,’between’,’India’,’and’,’New-Zealand’,’delayed’, ‘due’, ‘to’, ‘rain’] and the vector representation of the word can be one-hot encoded vector. The vector representation of the word ‘india’ according to the above vocabulary can be seen as [0,0,0,1,0,0,0,0,0,0].

If you try to visualize these vectors, we can think of a 10-dimensional space where each word occupies one dimension and has no relation with other vectors. Here comes the idea of distributed representation, introducing some dependence of one word to another word. In one hot encoding representation, all the words are independent of each other.

There are different word embedding techniques such as Count-Vectorizer, TFIDF-Vectorizer, Continuous bag of word and Skip-gram. Details of Count-Vectorizer and TFIDF-Vectorizer can be found here where classification tasks are carried out.

In this article, we mainly focused on the Word2Vec technique of word embedding.

Word2vec

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.

Word2vec is a two-layer neural network that processes text by “vectorizing” words. Its input is a text corpus, and its output is a set of vectors. Feature vectors that represent words in that corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Word2vec is not a single algorithm but a combination of the techniques mentioned above, i.e. CBOW(Continues Bag of word) and Skip-Gram.

A detailed explanation of CBOW with code examples can be found here; we will take a deep dive into the Skip-Gram technique.

Skip-Gram Model

The continuous skip-gram model learns by predicting the surrounding words given a current word. In other words, the Continuous Skip-Gram Model predicts words within a certain range before and after the current word in the same sentence.

The Continuous Bag of word predicts word provided neighbour context. As shown in the above architecture, the skip-gram predicts the context or neighbour words for a given word. The Skip-Gram model is trained on n-gram pairs of (target_word, context_word) with a token as 1 and 0. The token specifies whether the context_words are from the same window or generated randomly. The pair with token 0 is neglected.

Code Implementation of Skip-Gram Model

Steps to be followed:

Build the corpus vocabulary
Build a skip-gram [(target, context), relevancy] generator
Build the skip-gram model architecture
Train the Model
Get Word Embeddings

1. Build the corpus vocabulary:

The essential step while building any NLP based model is to create a corpus in which we extract each unique word from vocabulary and assign a unique numeric identifier to it.

In this article, the corpus we are using is ‘The King James Version of the Bible’, from Project Gutenberg, available free through the corpus model in nltk.

Import all dependencies:

 from nltk.corpus import gutenberg # to get bible corpus
 from string import punctuation # to remove punctuation from corpus
 import nltk 
 import numpy as np
 from keras.preprocessing import text
 from keras.preprocessing.sequence import skipgrams 
 from keras.layers import *
 from keras.layers.core import Dense, Reshape
 from keras.layers.embeddings import Embedding
 from keras.models import Model,Sequential

Download gutenberg project, punkt model and stopwords from nltk as below:

 nltk.download('gutenberg')
 nltk.download('punkt')
 nltk.download('stopwords')
 stop_words = nltk.corpus.stopwords.words('english')

We use a user-defined function for text preprocessing that removes extra whitespaces, digits, and stopwords and lower casing the text corpus.

 bible = gutenberg.sents("bible-kjv.txt")
 remove_terms = punctuation + '0123456789'
 wpt = nltk.WordPunctTokenizer()
 def normalize_document(doc):
     # lower case and remove special characters\whitespaces
     doc = re.sub(r'[^a-zA-Z\s]', '', doc,re.I|re.A)
     doc = doc.lower()
     doc = doc.strip()
     # tokenize document
     tokens = wpt.tokenize(doc)
     # filter stopwords out of document
     filtered_tokens = [token for token in tokens if token not in stop_words]
     # re-create document from filtered tokens
     doc = ' '.join(filtered_tokens)
     return doc
 normalize_corpus = np.vectorize(normalize_document)

Next, to extract unique word from the corpus and assigning a numeric identifier.

 norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible]
 norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible]
 norm_bible = filter(None, normalize_corpus(norm_bible))
 norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2]
 tokenizer = text.Tokenizer()
 tokenizer.fit_on_texts(norm_bible)
 word2id = tokenizer.word_index
 id2word = {v:k for k, v in word2id.items()}
 vocab_size = len(word2id) + 1
 wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible]
 print('Vocabulary Size:', vocab_size)
 print('Vocabulary Sample:', list(word2id.items())[:5])

Output:

 Vocabulary Size: 12588
 Vocabulary Sample: [('shall', 1), ('unto', 2), ('lord', 3), ('thou', 4), ('thy', 5)]

2. Build a Skip-Gram [(target, context), relevancy] generator:

Keras functional API provides model skip-gram, which generate a sequence of word index into tuples of words of the form:

(word, word in the same window), with label 1 (positive samples)

(word, random word from the vocabulary), with label 0 (negative samples)

 # generate skip-grams
 skip_grams = [skipgrams(wid, vocabulary_size=vocab_size, window_size=10) for wid in wids]
 # view sample skip-grams
 pairs, labels = skip_grams[0][0], skip_grams[0][1]
 for i in range(10):
     print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
           id2word[pairs[i][0]], pairs[i][0], 
           id2word[pairs[i][1]], pairs[i][1], 
           labels[i]))

Output:

 (king (13), james (1154)) -> 1
 (bible (5766), willing (1559)) -> 0
 (james (1154), king (13)) -> 1
 (james (1154), bible (5766)) -> 1
 (king (13), bible (5766)) -> 1
 (bible (5766), supper (2792)) -> 0
 (james (1154), moreover (378)) -> 0
 (king (13), comforters (4903)) -> 0
 (james (1154), nourish (4708)) -> 0
 (bible (5766), james (1154)) -> 1

3. Build the Skip-Gram model architecture:

By using Keras with backend support of TensorFlow, we will build a deep learning architect of skip-gram. Our input is targeted words, and context word pair means we need to process two inputs. This input is passed to a separate embedding layer to get word embedding for target and context words. Afterwards, we combine these two layers and pass the result to a dense layer that predicts either 1 or 0 depending on whether a pair of words is contextually relevant or just randomly generated.

 # build skip-gram architecture
 embed_size = 100
 word_model = Sequential()
 word_model.add(Embedding(vocab_size, embed_size,
                          embeddings_initializer="glorot_uniform",
                          input_length=1))
 word_model.add(Reshape((embed_size, )))
 context_model = Sequential()
 context_model.add(Embedding(vocab_size, embed_size,
                   embeddings_initializer="glorot_uniform",
                   input_length=1))
 context_model.add(Reshape((embed_size,)))
 merged_output = add([word_model.output, context_model.output])  
 model_combined = Sequential()
 model_combined.add(Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid"))
 final_model = Model([word_model.input, context_model.input], model_combined(merged_output))
 final_model.compile(loss="mean_squared_error", optimizer="rmsprop")
 final_model.summary()
 # visualize model structure
 from IPython.display import SVG
 from keras.utils.vis_utils import model_to_dot
 SVG(model_to_dot(final_model, show_shapes=True, show_layer_names=False, 
                  rankdir='TB').create(prog='dot', format='svg'))

Summary of model:

4. Train the model:

Training the model on a complete corpus takes more time; hence, we run a model for five epochs; you can increase the epochs if needed.

 for epoch in range(1, 3):
     loss = 0
     for i, elem in enumerate(skip_grams):
         pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
         pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
         labels = np.array(elem[1], dtype='int32')
         X = [pair_first_elem, pair_second_elem]
         Y = labels
         if i % 10000 == 0:
             print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
         loss += final_model.train_on_batch(X,Y)  
     print('Epoch:', epoch, 'Loss:', loss)

5. Get word embeddings:

To get word embeddings for our entire vocabulary, we can extract the same from our embedding layer. We will extract the weights of embeddings from our word_model embedding layer.

 from sklearn.metrics.pairwise import euclidean_distances
 word_embed_layer = word_model.layers[0]
 weights = word_embed_layer.get_weights()[0][1:]
 distance_matrix = euclidean_distances(weights)
 print(distance_matrix.shape)
 similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                    for search_term in ['god', 'jesus','egypt', 'john', 'famine']}
 similar_words

Output:

 (12424, 12424)
 {'egypt': ['congregation', 'stood', 'lad', 'officers', 'blood'],
  'famine': ['bank', 'corrupted', 'pit', 'ill', 'burdens'],
  'god': ['clothes', 'house', 'come', 'side', 'came'],
  'jesus': ['baptism', 'otherwise', 'general', 'shortly', 'wanting'],
  'john': ['zebedee', 'disciple', 'soldiers', 'council', 'repenteth']}

We can see that the model gives nearly correct words related to the target word. Accuracy can be increased by training more epochs but note it will add more computational time.

Wevi: Word Embedding Visual Inspector

Wevi is an interactive visual interface that demonstrates the mechanism of the CBOW and Skip-Gram Model. This allows the user to examine the movements of input and output vectors for each epoch. It also gives facilities to train models on different batches. With the help of Principal Component Analysis (PCA), it visualize the high dimensional vectors in the scatter plot.

Post-training, users can manually activate input and inspect and trace the layer up to the output layer, giving clear insight into the working mechanism. The user is also free to tune the hyperparameter like hidden layer size and learning rate.

Visit at http://bit.ly/wevi-online to learn more about the working mechanism of CBOW and skip-gram.

Conclusion:

This article taught us about word embeddings and their different types, from basic representation to advance. Later on, we have seen the practical implementation of word2vec by using skip-gram architecture. Lastly, we have seen the user-friendly interface of wevi, which gives clear insight into how models like CBOW and skip-gram works and users can track the neurons for respected words.