MITB Banner

Guide To Word2vec Using Skip Gram Model

In natural language processing, word embedding is a term used to represent words for text analysis,

Share

Illustration by Processed with VSCO with c1 preset

Whenever you start typing on your mobile phone, writing the mail, searching some content on google, you might have seen that the next word gets suggested automatically after typing a few words. For these examples and likewise, text processing is a common part. All the scenarios deal with numerous amounts of text to perform these tasks. So how do we make today’s machine perform clustering, classification on text data since they are insufficient in handling and processing? To make machines perform these tasks, creating a representation of words that capture their meanings, semantic relationships, and different contexts is used. And this all implemented using word embeddings or numerical representation of text so that the computer can handle it.

Word Embeddings

In natural language processing, word embedding is a term used to represent words for text analysis, typically in real-valued vectors that encode the word’s meaning. The words that are closed in vector space are expected to have a similar meaning. Word embedding uses language modeling and feature learning techniques where words from the vocabulary are mapped to vectors of real numbers.

Let’s take an example, text = “The match between India and New-Zealand delayed due to rain”

From the above text, we can form the dictionary of unique words as following.

[‘The’,’match’,’between’,’India’,’and’,’New-Zealand’,’delayed’, ‘due’, ‘to’, ‘rain’]  and the vector representation of the word can be one-hot encoded vector. The vector representation of the word ‘india’ according to the above vocabulary can be seen as [0,0,0,1,0,0,0,0,0,0].      

If you try to visualize these vectors, we can think of a 10-dimensional space where each word occupies one dimension and has no relation with other vectors. Here comes the idea of distributed representation, introducing some dependence of one word to another word. In one hot encoding representation, all the words are independent of each other.

There are different word embedding techniques such as Count-Vectorizer, TFIDF-Vectorizer, Continuous bag of word and Skip-gram. Details of Count-Vectorizer and TFIDF-Vectorizer can be found here where classification tasks are carried out. 

In this article, we mainly focused on the Word2Vec technique of word embedding. 

Word2vec

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.         

Word2vec is a two-layer neural network that processes text by “vectorizing” words. Its input is a text corpus, and its output is a set of vectors. Feature vectors that represent words in that corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors. 

Word2vec is not a single algorithm but a combination of the techniques mentioned above, i.e. CBOW(Continues Bag of word) and Skip-Gram.

A detailed explanation of CBOW with code examples can be found here; we will take a deep dive into the Skip-Gram technique. 

Skip-Gram Model

The continuous skip-gram model learns by predicting the surrounding words given a current word. In other words, the Continuous Skip-Gram Model predicts words within a certain range before and after the current word in the same sentence.

Fig 1: Skip-gram model architecture

The Continuous Bag of word predicts word provided neighbour context. As shown in the above architecture, the skip-gram predicts the context or neighbour words for a given word. The Skip-Gram model is trained on n-gram pairs of (target_word, context_word) with a token as 1 and 0. The token specifies whether the context_words are from the same window or generated randomly. The pair with token 0 is neglected.   

Code Implementation of Skip-Gram Model

Steps to be followed:

  1. Build the corpus vocabulary
  2. Build a skip-gram [(target, context), relevancy] generator
  3. Build the skip-gram model architecture
  4. Train the Model
  5. Get Word Embeddings
1. Build the corpus vocabulary:

The essential step while building any NLP based model is to create a corpus in which we extract each unique word from vocabulary and assign a unique numeric identifier to it.

In this article, the corpus we are using is ‘The King James Version of the Bible’, from Project Gutenberg, available free through the corpus model in nltk. 

Import all dependencies:

 from nltk.corpus import gutenberg # to get bible corpus
 from string import punctuation # to remove punctuation from corpus
 import nltk 
 import numpy as np
 from keras.preprocessing import text
 from keras.preprocessing.sequence import skipgrams 
 from keras.layers import *
 from keras.layers.core import Dense, Reshape
 from keras.layers.embeddings import Embedding
 from keras.models import Model,Sequential 

Download gutenberg project, punkt model and stopwords from nltk as below:

 nltk.download('gutenberg')
 nltk.download('punkt')
 nltk.download('stopwords')
 stop_words = nltk.corpus.stopwords.words('english') 

We use a user-defined function for text preprocessing that removes extra whitespaces, digits, and stopwords and lower casing the text corpus. 

 bible = gutenberg.sents("bible-kjv.txt")
 remove_terms = punctuation + '0123456789'
 wpt = nltk.WordPunctTokenizer()
 def normalize_document(doc):
     # lower case and remove special characters\whitespaces
     doc = re.sub(r'[^a-zA-Z\s]', '', doc,re.I|re.A)
     doc = doc.lower()
     doc = doc.strip()
     # tokenize document
     tokens = wpt.tokenize(doc)
     # filter stopwords out of document
     filtered_tokens = [token for token in tokens if token not in stop_words]
     # re-create document from filtered tokens
     doc = ' '.join(filtered_tokens)
     return doc
 normalize_corpus = np.vectorize(normalize_document) 

Next, to extract unique word from the corpus and assigning a numeric identifier.

 norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible]
 norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible]
 norm_bible = filter(None, normalize_corpus(norm_bible))
 norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2]
 tokenizer = text.Tokenizer()
 tokenizer.fit_on_texts(norm_bible)
 word2id = tokenizer.word_index
 id2word = {v:k for k, v in word2id.items()}
 vocab_size = len(word2id) + 1
 wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible]
 print('Vocabulary Size:', vocab_size)
 print('Vocabulary Sample:', list(word2id.items())[:5]) 

Output:

 Vocabulary Size: 12588
 Vocabulary Sample: [('shall', 1), ('unto', 2), ('lord', 3), ('thou', 4), ('thy', 5)] 
2. Build a Skip-Gram [(target, context), relevancy] generator:

Keras functional API provides model skip-gram, which generate a sequence of word index into tuples of words of the form:

(word, word in the same window), with label 1 (positive samples)

(word, random word from the vocabulary), with label 0 (negative samples) 

 # generate skip-grams
 skip_grams = [skipgrams(wid, vocabulary_size=vocab_size, window_size=10) for wid in wids]
 # view sample skip-grams
 pairs, labels = skip_grams[0][0], skip_grams[0][1]
 for i in range(10):
     print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
           id2word[pairs[i][0]], pairs[i][0], 
           id2word[pairs[i][1]], pairs[i][1], 
           labels[i])) 

Output:

 (king (13), james (1154)) -> 1
 (bible (5766), willing (1559)) -> 0
 (james (1154), king (13)) -> 1
 (james (1154), bible (5766)) -> 1
 (king (13), bible (5766)) -> 1
 (bible (5766), supper (2792)) -> 0
 (james (1154), moreover (378)) -> 0
 (king (13), comforters (4903)) -> 0
 (james (1154), nourish (4708)) -> 0
 (bible (5766), james (1154)) -> 1 
3. Build the Skip-Gram model architecture:

By using Keras with backend support of TensorFlow, we will build a deep learning architect of skip-gram. Our input is targeted words, and context word pair means we need to process two inputs. This input is passed to a separate embedding layer to get word embedding for target and context words. Afterwards, we combine these two layers and pass the result to a dense layer that predicts either 1 or 0 depending on whether a pair of words is contextually relevant or just randomly generated.

 # build skip-gram architecture
 embed_size = 100
 word_model = Sequential()
 word_model.add(Embedding(vocab_size, embed_size,
                          embeddings_initializer="glorot_uniform",
                          input_length=1))
 word_model.add(Reshape((embed_size, )))
 context_model = Sequential()
 context_model.add(Embedding(vocab_size, embed_size,
                   embeddings_initializer="glorot_uniform",
                   input_length=1))
 context_model.add(Reshape((embed_size,)))
 merged_output = add([word_model.output, context_model.output])  
 model_combined = Sequential()
 model_combined.add(Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid"))
 final_model = Model([word_model.input, context_model.input], model_combined(merged_output))
 final_model.compile(loss="mean_squared_error", optimizer="rmsprop")
 final_model.summary()
 # visualize model structure
 from IPython.display import SVG
 from keras.utils.vis_utils import model_to_dot
 SVG(model_to_dot(final_model, show_shapes=True, show_layer_names=False, 
                  rankdir='TB').create(prog='dot', format='svg')) 
Summary of model:
4. Train the model:

Training the model on a complete corpus takes more time; hence, we run a model for five epochs; you can increase the epochs if needed.

 for epoch in range(1, 3):
     loss = 0
     for i, elem in enumerate(skip_grams):
         pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
         pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
         labels = np.array(elem[1], dtype='int32')
         X = [pair_first_elem, pair_second_elem]
         Y = labels
         if i % 10000 == 0:
             print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
         loss += final_model.train_on_batch(X,Y)  
     print('Epoch:', epoch, 'Loss:', loss) 
5. Get word embeddings:

To get word embeddings for our entire vocabulary, we can extract the same from our embedding layer. We will extract the weights of embeddings from our word_model embedding layer. 

 from sklearn.metrics.pairwise import euclidean_distances
 word_embed_layer = word_model.layers[0]
 weights = word_embed_layer.get_weights()[0][1:]
 distance_matrix = euclidean_distances(weights)
 print(distance_matrix.shape)
 similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                    for search_term in ['god', 'jesus','egypt', 'john', 'famine']}
 similar_words 

Output:

 (12424, 12424)
 {'egypt': ['congregation', 'stood', 'lad', 'officers', 'blood'],
  'famine': ['bank', 'corrupted', 'pit', 'ill', 'burdens'],
  'god': ['clothes', 'house', 'come', 'side', 'came'],
  'jesus': ['baptism', 'otherwise', 'general', 'shortly', 'wanting'],
  'john': ['zebedee', 'disciple', 'soldiers', 'council', 'repenteth']} 

We can see that the model gives nearly correct words related to the target word. Accuracy can be increased by training more epochs but note it will add more computational time.

Wevi: Word Embedding Visual Inspector

Wevi is an interactive visual interface that demonstrates the mechanism of the CBOW and Skip-Gram Model. This allows the user to examine the movements of input and output vectors for each epoch. It also gives facilities to train models on different batches. With the help of Principal Component Analysis (PCA), it visualize the high dimensional vectors in the scatter plot.

Post-training, users can manually activate input and inspect and trace the layer up to the output layer, giving clear insight into the working mechanism. The user is also free to tune the hyperparameter like hidden layer size and learning rate.  

Visit at http://bit.ly/wevi-online to learn more about the working mechanism of CBOW and skip-gram.

Screenshot of wevi interface 

Conclusion:

This article taught us about word embeddings and their different types, from basic representation to advance. Later on, we have seen the practical implementation of word2vec by using skip-gram architecture. Lastly, we have seen the user-friendly interface of wevi, which gives clear insight into how models like CBOW and skip-gram works and users can track the neurons for respected words. 

References:

Share
Picture of Vijaysinh Lendave

Vijaysinh Lendave

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.