Guide To Distributed Representations in ML

Distributed Representations (DR) play a significant role in machine learning. DR is a principled way of representing entities (say, cats or dogs) in terms of vectors. Entities sharing common properties have vector representations that are nearer to each other.

Numeric Representations and the Role of DR

The input and output of the Machine Learning (ML) models are often numeric. This requires finding a suitable numeric representation of text. Consider that the following sentences are used to train an ML model.

  • He is a King.
  • King is a man.
  • Queen is a woman.
  • She is a Queen.
  • King and Queen are rulers.

For the words to be fed as an input to the model, it needs a mathematical representation. One-Hot encoding of words to vectors is one way to get this representation. The dimension of each vector is equal to the number of unique words in all the sentences. This collection of unique words is referred to as vocabulary. Using one-hot representation, we have, King represented as [0 1 0 0 0 0 0] and Queen represented as [0 0 0 1 0 0 0]. This is an example of local representation of words if the vocabulary of the above sentences considers only nouns and pronouns. However, this representation is not very expressive as it does not capture much information about similar words. For example, King and Queen are rulers. 

The number of unique words increases with the increase in size of the input data. This requires longer word vectors to represent each word. Moreover, we do not capture the semantic similarity of words. To overcome these issues, we have DR of words to represent similar words by nearer vectors.

Building DR with an Example

Let us imagine that we want to express the words “Man”, “Woman”, ”King”, “Queen”, “Ruler” using 2-D vectors such that they preserve the following semantics:

  • King-Man+Woman —> Queen
  • King-Man —> Ruler
  • Ruler+Woman —> Queen

Note that we have used the standard vector representation of the variable with an overhead arrow. For example, 

King is the vector representation of the word “King”. If the rules of vector arithmetic should hold, one way to choose vectors satisfying the above rules is as shown below.

Man =[0,1], Woman=[2,1]

King =[1,1], Queen=[3,1]

Ruler =[1,0]

The vector representation of the words above can be visualized in a two-dimensional vector space as shown below.

However, the example taken here is very simple. In real-world scenarios, there could be thousands of words to deal with and hundreds of thousands of contexts by which multiple words could be related to each other. In such cases, assigning appropriate vector representations to these words manually would be cumbersome or even infeasible. So, we need a generalized way to perform the task.

The theory of deep learning has produced beautiful results in this respect. Before diving into the details of achieving DR using deep learning, we must take a look at a simple implementation of the popular Word2Vec model.

Building DR using Word2Vec Model

Let us start with a small text corpus which is a collection of the following sentences.

  • We went to the beach on a sunny day.
  • There were many tourists on the beach.
  • We went to the nearby museum with other tourists.

Our objective is to automate the derivation of the word vectors. We use the above sentences to learn the semantic similarity of words. As we are interested in word vectors, we start with tokenizing the sentences.

sentences = [[We, went, to, the, beach, on, a, sunny, day],[There, were, many, tourists, on, the, beach], [We, went, to, the, nearby, museum, with, other, tourists]]

We have seen that one-hot encoding of words is an inefficient way of vectorial representation. A better way is to use Word Embeddings. Embeddings provide dense representations where similar words are identifiable and also reduce the dimensionality of the vector. A Machine Learning (ML) model is trained to learn the values of the embeddings.

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

The vector representations can be plotted in the 2-D vector space as follows.

from matplotlib import pyplot
pyplot.scatter(result[:, 0],result[:, 1])
for i, word in enumerate(words):
a,b = result[i,0],result[i,1]

Various neural network models have been designed to build word embeddings. Common Bag of Words (CBOW) and Skip Gram are two such examples. CBOW model is a popular neural network implementation to arrive at the distributed representation of words. While in CBOW model, a word is predicted from its context word, the Skip Gram model attempts to predict the context word with respect to a word. When implemented using a neural network, the input layer of CBOW model contains multiple context words as input and the output is a single word. The Skip Gram model, however, contains a single input word and the output layer comprises multiple context words corresponding to the input.

DR using DL

Generally, we can consider a Deep Neural Network (DNN) to be composed of an input layer that takes input vectors as input to the DNN, hidden layers (often seen as a black box) and an output layer that gives the output vector. The weights in each hidden layer serve as a compact representation of the input vectors, which in most cases are understood only by the neural network. These new input representations can be used to transform the input vector into a lower-dimensional vector.

If we input words to a DNN and decide on a loss function, then the DNN can be trained by the backpropagation algorithm. The output of such a DNN can be used as the distributed representations of the words.

To understand this better, let us perform a task of binary classification (positive, negative) for sentiment analysis of IMDB movie reviews in the Large Movie Review Dataset, which has 25,000 labelled movie reviews.

  1. Find the data here.
  2. We are now ready to write the Python code to implement the sentiment analysis classifier which uses the idea of embeddings. Add the following library imports.

import io
import os
import re
import shutil
import string
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import
Dense, Embedding,
from tensorflow.keras.layers.
experimental.preprocessing import

  1. Each data point is a movie review which is classified as a positive or a negative sentiment. We need to firstly upload the data for use by executing:

url =

dataset = tf.keras.utils.get_file(
“aclImdb_v1.tar.gz”, url, untar=True, cache_dir = ’.’ cache_subdir = ‘ ‘)
dataset_dir =os.path.join(
os.path.dirname(dataset), aclImdb)
train_dir = os.path.join(dataset_dir, train)
remove_dir = os.path.join(train_dir,     unsup)

  1. Our data is split into training and validation set by using the keras.preprocessing module by executing:

batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.
aclImdb/train, batch_size=batch_size,
validation_split=0.2, s ubset=training, seed=seed)
val_ds = tf.keras.preprocessing.
text_dataset_from_directory(aclImdb/train, batch_size=batch_size, validation_split=0.2, subset=validation, seed=seed)

  1. Often, the text data that we use directly from a dataset available online can have undesired components like HTML tags, punctuations, etc. So, we need to do some preprocessing to clean the text data. This is taken care of while initializing the TextVectorization layer. It also creates the vocabulary which is used while training the model. We create a standardized function to use in the TextVectorization layer, so that it can clean the text data according to our needs.

def custom_standardization(input_data):
    lowercase =   tf.strings.lower(
                stripped_html = tf.strings.
<br />,  )
   return tf.strings.regex_replace(
            stripped_html, [%s] %
            re.escape(string.punctuation),’ ‘)
vocab_size = 10000
sequence_length = 50
vectorize_layer = TextVectorization(
output_sequence_length =
text_ds = x, y: x)

  1. The task that we have chosen to get DR is sentiment classification. A neural network is built to achieve sentiment classification as a model. We use Adam as an optimization algorithm, Binary Cross-Entropy as the loss, and accuracy as the

performance parameter to train the model.
model =   Sequential([vectorize_layer,
embedding_dim, name=”embedding”),

  1. Now, we use the train and validation data sets to train the model. The number of epochs used here is 15 as the focus is on explaining the workings of the code. However, a higher number of epochs might lead to better accuracy. We can check this by executing:,validation_data=
val_ds, epochs=15,callbacks=

  1. Now, since we have trained our model to a certain level of accuracy, we can retrieve the word embeddings from our Google Colab project, which are the new DR of words in the vocabulary.

weights = model.get_layer(embedding).
vocab =           vectorize_layer.get_vocabulary()
out_v =, w, encoding=utf-8)
out_m =, w, encoding=utf-8)
for index, word in enumerate(vocab):
if index == 0:
vec = weights[index]
out_v.write(\t.join([str(x) for x
in vec]) + “\n”)
out_m.write(word + “\n”)

  1. We can download the embeddings from Google Colab to our local disk by executing:

from google.colab import files
except Exception:

  1. We have now got the DR of words in the vocabulary. We can visualize them by uploading the embedding files downloaded above here.
  1. We have chosen the vocabulary size to be 10,000 and the corresponding embedding for each word in the vocabulary is 16-dimensional. This means that the embedding layer in the model above has 160,000 parameters. This can be confirmed

and the other details of the model used above can be viewed by executing:


We have shared the code used in this article here.

Applications of DR

We saw an application of embeddings in the sentiment analysis task. Another fascinating example of distributed representation is that when words are trained together with paragraphs, we can predict the equivalence between two similar, yet different words. Suppose you come across a paragraph about Bill Gates. The article is a paragraph and ‘Microsoft’ would be an obvious word in the paragraph. Now check out what distributed representation can do:

ParagraphVector(“Bill Gates”)-



→ ParagraphVector(\Steve Jobs”)

Such exciting results find several applications in the real world. We see embeddings used in almost every domain such as legal text understanding, clinical healthcare or even software engineering.


Anisha Saha is a graduate student at the Chennai Mathematical Institute. Her interests are in Data Science and Machine Learning.
Chandrashish Prasad is a graduate student at the Chennai Mathematical Institute. His interests are in Software Engineering and Deep Learning.
Venkatesh Vinayakarao is a lecturer in the Department of Computer Science at the ChennaiMathematical Institute. His interests are in information Retrieval, Program Analysis and software Engineering.

More Great AIM Stories

Analytics India Magazine
Analytics India Magazine chronicles technological progress in the space of analytics, artificial intelligence, data science & big data by highlighting the innovations, players, and challenges shaping the future of India through promotion and discussion of ideas and thoughts by smart, ardent, action-oriented individuals who want to change the world.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM