Search

# Guide To Word2vec Using Skip Gram Model

In natural language processing, word embedding is a term used to represent words for text analysis,

## Design by Processed with VSCO with c1 preset

Whenever you start typing on your mobile phone, writing the mail, searching some content on google, you might have seen that the next word gets suggested automatically after typing a few words. For these examples and likewise, text processing is a common part. All the scenarios deal with numerous amounts of text to perform these tasks. So how do we make today’s machine perform clustering, classification on text data since they are insufficient in handling and processing? To make machines perform these tasks, creating a representation of words that capture their meanings, semantic relationships, and different contexts is used. And this all implemented using word embeddings or numerical representation of text so that the computer can handle it.

## Word Embeddings

In natural language processing, word embedding is a term used to represent words for text analysis, typically in real-valued vectors that encode the word’s meaning. The words that are closed in vector space are expected to have a similar meaning. Word embedding uses language modeling and feature learning techniques where words from the vocabulary are mapped to vectors of real numbers.

Let’s take an example, text = “The match between India and New-Zealand delayed due to rain”

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

From the above text, we can form the dictionary of unique words as following.

[‘The’,’match’,’between’,’India’,’and’,’New-Zealand’,’delayed’, ‘due’, ‘to’, ‘rain’]  and the vector representation of the word can be one-hot encoded vector. The vector representation of the word ‘india’ according to the above vocabulary can be seen as [0,0,0,1,0,0,0,0,0,0].

If you try to visualize these vectors, we can think of a 10-dimensional space where each word occupies one dimension and has no relation with other vectors. Here comes the idea of distributed representation, introducing some dependence of one word to another word. In one hot encoding representation, all the words are independent of each other.

There are different word embedding techniques such as Count-Vectorizer, TFIDF-Vectorizer, Continuous bag of word and Skip-gram. Details of Count-Vectorizer and TFIDF-Vectorizer can be found here where classification tasks are carried out.

In this article, we mainly focused on the Word2Vec technique of word embedding.

## Word2vec

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.

Word2vec is a two-layer neural network that processes text by “vectorizing” words. Its input is a text corpus, and its output is a set of vectors. Feature vectors that represent words in that corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Word2vec is not a single algorithm but a combination of the techniques mentioned above, i.e. CBOW(Continues Bag of word) and Skip-Gram.

A detailed explanation of CBOW with code examples can be found here; we will take a deep dive into the Skip-Gram technique.

## Skip-Gram Model

The continuous skip-gram model learns by predicting the surrounding words given a current word. In other words, the Continuous Skip-Gram Model predicts words within a certain range before and after the current word in the same sentence.

The Continuous Bag of word predicts word provided neighbour context. As shown in the above architecture, the skip-gram predicts the context or neighbour words for a given word. The Skip-Gram model is trained on n-gram pairs of (target_word, context_word) with a token as 1 and 0. The token specifies whether the context_words are from the same window or generated randomly. The pair with token 0 is neglected.

## Code Implementation of Skip-Gram Model

Steps to be followed:

1. Build the corpus vocabulary
2. Build a skip-gram [(target, context), relevancy] generator
3. Build the skip-gram model architecture
4. Train the Model
5. Get Word Embeddings
##### 1. Build the corpus vocabulary:

The essential step while building any NLP based model is to create a corpus in which we extract each unique word from vocabulary and assign a unique numeric identifier to it.

In this article, the corpus we are using is ‘The King James Version of the Bible’, from Project Gutenberg, available free through the corpus model in nltk.

Import all dependencies:

``` from nltk.corpus import gutenberg # to get bible corpus
from string import punctuation # to remove punctuation from corpus
import nltk
import numpy as np
from keras.preprocessing import text
from keras.preprocessing.sequence import skipgrams
from keras.layers import *
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Model,Sequential ```

``` nltk.download('gutenberg')
stop_words = nltk.corpus.stopwords.words('english') ```

We use a user-defined function for text preprocessing that removes extra whitespaces, digits, and stopwords and lower casing the text corpus.

``` bible = gutenberg.sents("bible-kjv.txt")
remove_terms = punctuation + '0123456789'
wpt = nltk.WordPunctTokenizer()
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z\s]', '', doc,re.I|re.A)
doc = doc.lower()
doc = doc.strip()
# tokenize document
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
# re-create document from filtered tokens
doc = ' '.join(filtered_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document) ```

Next, to extract unique word from the corpus and assigning a numeric identifier.

``` norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible]
norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible]
norm_bible = filter(None, normalize_corpus(norm_bible))
norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2]
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(norm_bible)
word2id = tokenizer.word_index
id2word = {v:k for k, v in word2id.items()}
vocab_size = len(word2id) + 1
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible]
print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:5]) ```

Output:

``` Vocabulary Size: 12588
Vocabulary Sample: [('shall', 1), ('unto', 2), ('lord', 3), ('thou', 4), ('thy', 5)] ```
##### 2. Build a Skip-Gram [(target, context), relevancy] generator:

Keras functional API provides model skip-gram, which generate a sequence of word index into tuples of words of the form:

(word, word in the same window), with label 1 (positive samples)

(word, random word from the vocabulary), with label 0 (negative samples)

``` # generate skip-grams
skip_grams = [skipgrams(wid, vocabulary_size=vocab_size, window_size=10) for wid in wids]
# view sample skip-grams
pairs, labels = skip_grams[0][0], skip_grams[0][1]
for i in range(10):
print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
id2word[pairs[i][0]], pairs[i][0],
id2word[pairs[i][1]], pairs[i][1],
labels[i])) ```

Output:

``` (king (13), james (1154)) -> 1
(bible (5766), willing (1559)) -> 0
(james (1154), king (13)) -> 1
(james (1154), bible (5766)) -> 1
(king (13), bible (5766)) -> 1
(bible (5766), supper (2792)) -> 0
(james (1154), moreover (378)) -> 0
(king (13), comforters (4903)) -> 0
(james (1154), nourish (4708)) -> 0
(bible (5766), james (1154)) -> 1 ```
##### 3. Build the Skip-Gram model architecture:

By using Keras with backend support of TensorFlow, we will build a deep learning architect of skip-gram. Our input is targeted words, and context word pair means we need to process two inputs. This input is passed to a separate embedding layer to get word embedding for target and context words. Afterwards, we combine these two layers and pass the result to a dense layer that predicts either 1 or 0 depending on whether a pair of words is contextually relevant or just randomly generated.

``` # build skip-gram architecture
embed_size = 100
word_model = Sequential()
embeddings_initializer="glorot_uniform",
input_length=1))
context_model = Sequential()
embeddings_initializer="glorot_uniform",
input_length=1))
model_combined = Sequential()
final_model = Model([word_model.input, context_model.input], model_combined(merged_output))
final_model.compile(loss="mean_squared_error", optimizer="rmsprop")
final_model.summary()
# visualize model structure
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(final_model, show_shapes=True, show_layer_names=False,
rankdir='TB').create(prog='dot', format='svg')) ```
##### 4. Train the model:

Training the model on a complete corpus takes more time; hence, we run a model for five epochs; you can increase the epochs if needed.

``` for epoch in range(1, 3):
loss = 0
for i, elem in enumerate(skip_grams):
pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
labels = np.array(elem[1], dtype='int32')
X = [pair_first_elem, pair_second_elem]
Y = labels
if i % 10000 == 0:
print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
loss += final_model.train_on_batch(X,Y)
print('Epoch:', epoch, 'Loss:', loss) ```
##### 5. Get word embeddings:

To get word embeddings for our entire vocabulary, we can extract the same from our embedding layer. We will extract the weights of embeddings from our word_model embedding layer.

``` from sklearn.metrics.pairwise import euclidean_distances
word_embed_layer = word_model.layers[0]
weights = word_embed_layer.get_weights()[0][1:]
distance_matrix = euclidean_distances(weights)
print(distance_matrix.shape)
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1]
for search_term in ['god', 'jesus','egypt', 'john', 'famine']}
similar_words ```

Output:

``` (12424, 12424)
{'egypt': ['congregation', 'stood', 'lad', 'officers', 'blood'],
'famine': ['bank', 'corrupted', 'pit', 'ill', 'burdens'],
'god': ['clothes', 'house', 'come', 'side', 'came'],
'jesus': ['baptism', 'otherwise', 'general', 'shortly', 'wanting'],
'john': ['zebedee', 'disciple', 'soldiers', 'council', 'repenteth']} ```

We can see that the model gives nearly correct words related to the target word. Accuracy can be increased by training more epochs but note it will add more computational time.

## Wevi: Word Embedding Visual Inspector

Wevi is an interactive visual interface that demonstrates the mechanism of the CBOW and Skip-Gram Model. This allows the user to examine the movements of input and output vectors for each epoch. It also gives facilities to train models on different batches. With the help of Principal Component Analysis (PCA), it visualize the high dimensional vectors in the scatter plot.

Post-training, users can manually activate input and inspect and trace the layer up to the output layer, giving clear insight into the working mechanism. The user is also free to tune the hyperparameter like hidden layer size and learning rate.

## Conclusion:

This article taught us about word embeddings and their different types, from basic representation to advance. Later on, we have seen the practical implementation of word2vec by using skip-gram architecture. Lastly, we have seen the user-friendly interface of wevi, which gives clear insight into how models like CBOW and skip-gram works and users can track the neurons for respected words.

## References:

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Is GPT-4 Really Better than Radiologists?

“Radiology report summaries created by GPT-4 are comparable, and in some cases, even preferred over

### TSMC: The Wizard Behind AI’s Curtain

TSMC anticipates a substantial CAGR of nearly 50% in the AI sector from 2022 to 2027.

Not really.

### Google Gemini To Arrive Sooner Than Expected

This is after announcing the AI at the Google I/O 2023, the company had postponed

### ByteDance to Launch Platform to Build Custom Chatbots

This comes just a few days after OpenAI had delayed its plan to launch a

### This New AI tool Could Mark the Beginning of the End for TikTok and Instagram Influencers

Alibaba Group announces a model framework that can transform still images into dynamic character videos

### Embracing Identity: The Journey of Sujoy Das

“Why is it that corporate diversity efforts are often limited to specific times of the

### The Biggest Data Breaches of 2023

The most significant breaches that impacted the global landscape in 2023.

### NVIDIA Planning Big Expansions in Japan

Prime Minister Fumio Kishida has extended billions of dollars in financial support to bolster TSMC

### Runway Partners with Getty to Build Video Generation Model for Enterprises

Runway enterprise users can refine RGM with their proprietary datasets, benefiting various industries like Hollywood,