Whenever you start typing on your mobile phone, writing the mail, searching some content on google, you might have seen that the next word gets suggested automatically after typing a few words. For these examples and likewise, text processing is a common part. All the scenarios deal with numerous amounts of text to perform these tasks. So how do we make today’s machine perform clustering, classification on text data since they are insufficient in handling and processing? To make machines perform these tasks, creating a representation of words that capture their meanings, semantic relationships, and different contexts is used. And this all implemented using word embeddings or numerical representation of text so that the computer can handle it.
Word Embeddings
In natural language processing, word embedding is a term used to represent words for text analysis, typically in real-valued vectors that encode the word’s meaning. The words that are closed in vector space are expected to have a similar meaning. Word embedding uses language modeling and feature learning techniques where words from the vocabulary are mapped to vectors of real numbers.
Let’s take an example, text = “The match between India and New-Zealand delayed due to rain”
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
From the above text, we can form the dictionary of unique words as following.

[‘The’,’match’,’between’,’India’,’and’,’New-Zealand’,’delayed’, ‘due’, ‘to’, ‘rain’] and the vector representation of the word can be one-hot encoded vector. The vector representation of the word ‘india’ according to the above vocabulary can be seen as [0,0,0,1,0,0,0,0,0,0].
If you try to visualize these vectors, we can think of a 10-dimensional space where each word occupies one dimension and has no relation with other vectors. Here comes the idea of distributed representation, introducing some dependence of one word to another word. In one hot encoding representation, all the words are independent of each other.
There are different word embedding techniques such as Count-Vectorizer, TFIDF-Vectorizer, Continuous bag of word and Skip-gram. Details of Count-Vectorizer and TFIDF-Vectorizer can be found here where classification tasks are carried out.
In this article, we mainly focused on the Word2Vec technique of word embedding.
Word2vec
Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.
Word2vec is a two-layer neural network that processes text by “vectorizing” words. Its input is a text corpus, and its output is a set of vectors. Feature vectors that represent words in that corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.
Word2vec is not a single algorithm but a combination of the techniques mentioned above, i.e. CBOW(Continues Bag of word) and Skip-Gram.
A detailed explanation of CBOW with code examples can be found here; we will take a deep dive into the Skip-Gram technique.
Skip-Gram Model
The continuous skip-gram model learns by predicting the surrounding words given a current word. In other words, the Continuous Skip-Gram Model predicts words within a certain range before and after the current word in the same sentence.
The Continuous Bag of word predicts word provided neighbour context. As shown in the above architecture, the skip-gram predicts the context or neighbour words for a given word. The Skip-Gram model is trained on n-gram pairs of (target_word, context_word) with a token as 1 and 0. The token specifies whether the context_words are from the same window or generated randomly. The pair with token 0 is neglected.
Code Implementation of Skip-Gram Model
Steps to be followed:
- Build the corpus vocabulary
- Build a skip-gram [(target, context), relevancy] generator
- Build the skip-gram model architecture
- Train the Model
- Get Word Embeddings
1. Build the corpus vocabulary:
The essential step while building any NLP based model is to create a corpus in which we extract each unique word from vocabulary and assign a unique numeric identifier to it.
In this article, the corpus we are using is ‘The King James Version of the Bible’, from Project Gutenberg, available free through the corpus model in nltk.
Import all dependencies:
from nltk.corpus import gutenberg # to get bible corpus from string import punctuation # to remove punctuation from corpus import nltk import numpy as np from keras.preprocessing import text from keras.preprocessing.sequence import skipgrams from keras.layers import * from keras.layers.core import Dense, Reshape from keras.layers.embeddings import Embedding from keras.models import Model,Sequential
Download gutenberg project, punkt model and stopwords from nltk as below:
nltk.download('gutenberg') nltk.download('punkt') nltk.download('stopwords') stop_words = nltk.corpus.stopwords.words('english')
We use a user-defined function for text preprocessing that removes extra whitespaces, digits, and stopwords and lower casing the text corpus.
bible = gutenberg.sents("bible-kjv.txt") remove_terms = punctuation + '0123456789' wpt = nltk.WordPunctTokenizer() def normalize_document(doc): # lower case and remove special characters\whitespaces doc = re.sub(r'[^a-zA-Z\s]', '', doc,re.I|re.A) doc = doc.lower() doc = doc.strip() # tokenize document tokens = wpt.tokenize(doc) # filter stopwords out of document filtered_tokens = [token for token in tokens if token not in stop_words] # re-create document from filtered tokens doc = ' '.join(filtered_tokens) return doc normalize_corpus = np.vectorize(normalize_document)
Next, to extract unique word from the corpus and assigning a numeric identifier.
norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible] norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible] norm_bible = filter(None, normalize_corpus(norm_bible)) norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2] tokenizer = text.Tokenizer() tokenizer.fit_on_texts(norm_bible) word2id = tokenizer.word_index id2word = {v:k for k, v in word2id.items()} vocab_size = len(word2id) + 1 wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible] print('Vocabulary Size:', vocab_size) print('Vocabulary Sample:', list(word2id.items())[:5])
Output:
Vocabulary Size: 12588 Vocabulary Sample: [('shall', 1), ('unto', 2), ('lord', 3), ('thou', 4), ('thy', 5)]
2. Build a Skip-Gram [(target, context), relevancy] generator:
Keras functional API provides model skip-gram, which generate a sequence of word index into tuples of words of the form:
(word, word in the same window), with label 1 (positive samples)
(word, random word from the vocabulary), with label 0 (negative samples)
# generate skip-grams skip_grams = [skipgrams(wid, vocabulary_size=vocab_size, window_size=10) for wid in wids] # view sample skip-grams pairs, labels = skip_grams[0][0], skip_grams[0][1] for i in range(10): print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format( id2word[pairs[i][0]], pairs[i][0], id2word[pairs[i][1]], pairs[i][1], labels[i]))
Output:
(king (13), james (1154)) -> 1 (bible (5766), willing (1559)) -> 0 (james (1154), king (13)) -> 1 (james (1154), bible (5766)) -> 1 (king (13), bible (5766)) -> 1 (bible (5766), supper (2792)) -> 0 (james (1154), moreover (378)) -> 0 (king (13), comforters (4903)) -> 0 (james (1154), nourish (4708)) -> 0 (bible (5766), james (1154)) -> 1
3. Build the Skip-Gram model architecture:
By using Keras with backend support of TensorFlow, we will build a deep learning architect of skip-gram. Our input is targeted words, and context word pair means we need to process two inputs. This input is passed to a separate embedding layer to get word embedding for target and context words. Afterwards, we combine these two layers and pass the result to a dense layer that predicts either 1 or 0 depending on whether a pair of words is contextually relevant or just randomly generated.
# build skip-gram architecture embed_size = 100 word_model = Sequential() word_model.add(Embedding(vocab_size, embed_size, embeddings_initializer="glorot_uniform", input_length=1)) word_model.add(Reshape((embed_size, ))) context_model = Sequential() context_model.add(Embedding(vocab_size, embed_size, embeddings_initializer="glorot_uniform", input_length=1)) context_model.add(Reshape((embed_size,))) merged_output = add([word_model.output, context_model.output]) model_combined = Sequential() model_combined.add(Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid")) final_model = Model([word_model.input, context_model.input], model_combined(merged_output)) final_model.compile(loss="mean_squared_error", optimizer="rmsprop") final_model.summary() # visualize model structure from IPython.display import SVG from keras.utils.vis_utils import model_to_dot SVG(model_to_dot(final_model, show_shapes=True, show_layer_names=False, rankdir='TB').create(prog='dot', format='svg'))
Summary of model:
4. Train the model:
Training the model on a complete corpus takes more time; hence, we run a model for five epochs; you can increase the epochs if needed.
for epoch in range(1, 3): loss = 0 for i, elem in enumerate(skip_grams): pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32') pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32') labels = np.array(elem[1], dtype='int32') X = [pair_first_elem, pair_second_elem] Y = labels if i % 10000 == 0: print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i)) loss += final_model.train_on_batch(X,Y) print('Epoch:', epoch, 'Loss:', loss)
5. Get word embeddings:
To get word embeddings for our entire vocabulary, we can extract the same from our embedding layer. We will extract the weights of embeddings from our word_model embedding layer.
from sklearn.metrics.pairwise import euclidean_distances word_embed_layer = word_model.layers[0] weights = word_embed_layer.get_weights()[0][1:] distance_matrix = euclidean_distances(weights) print(distance_matrix.shape) similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] for search_term in ['god', 'jesus','egypt', 'john', 'famine']} similar_words
Output:
(12424, 12424) {'egypt': ['congregation', 'stood', 'lad', 'officers', 'blood'], 'famine': ['bank', 'corrupted', 'pit', 'ill', 'burdens'], 'god': ['clothes', 'house', 'come', 'side', 'came'], 'jesus': ['baptism', 'otherwise', 'general', 'shortly', 'wanting'], 'john': ['zebedee', 'disciple', 'soldiers', 'council', 'repenteth']}
We can see that the model gives nearly correct words related to the target word. Accuracy can be increased by training more epochs but note it will add more computational time.
Wevi: Word Embedding Visual Inspector
Wevi is an interactive visual interface that demonstrates the mechanism of the CBOW and Skip-Gram Model. This allows the user to examine the movements of input and output vectors for each epoch. It also gives facilities to train models on different batches. With the help of Principal Component Analysis (PCA), it visualize the high dimensional vectors in the scatter plot.
Post-training, users can manually activate input and inspect and trace the layer up to the output layer, giving clear insight into the working mechanism. The user is also free to tune the hyperparameter like hidden layer size and learning rate.
Visit at http://bit.ly/wevi-online to learn more about the working mechanism of CBOW and skip-gram.
Conclusion:
This article taught us about word embeddings and their different types, from basic representation to advance. Later on, we have seen the practical implementation of word2vec by using skip-gram architecture. Lastly, we have seen the user-friendly interface of wevi, which gives clear insight into how models like CBOW and skip-gram works and users can track the neurons for respected words.