This 2013 Seminal Paper By Google Researchers Changed NLP Forever. Here’s A Deep Dive

greg corrado
Greg Corrado

The purpose of a natural language is to facilitate communication and ideas among people. These ideas converge to identify the meaning of the utterance of the text. This meaning is called semantics. Researchers in the field of natural language processing (NLP) and computational linguistics try to outline theories and approaches to natural language semantics.

One of the milestones in the modern NLP practice has been the invention of embedded word vectors. A 2013 paper titled Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean, introduced techniques that can be used for learning high-quality word vectors from huge data sets with billions of words and with millions of words in the vocabulary. This was a breakthrough because the paper provided a much-needed alternative to the n-gram models. The simple techniques like n-gram models had reached their limits in many tasks. In domain-heavy tasks such as speech recognition, the results are mainly dominated by the high quality of the transcribed speech data. Thus, there were instances where simply scaling up the corpora didn’t enhance the performance.

The ideas stated in the paper were hugely successful and were used to make advances in the problem of capturing semantics and the semantic relationship between the words. The paper used a distributed and continuous representation of words as opposed to a 1-of-N encoding. The researchers Mikolov and others, weren’t the first to use continuous vector representations of words, but they substantially reduced the computational complexity of learning such representations.

Word Vectors

To put it simply, word vectors are just numerical representation of text, and may take many forms. One of the most common word vector representation is 1-of-N encoding. The encoding of a given word is simply the vector in which the corresponding element is set to one, and all other elements are zero. Broadly, we can classify the types of embeddings in two:

  1. Frequency based word embeddings
  2. Prediction based word embeddings

Learning word vectors

Efficient Estimation of Word Representations in Vector Space proposed two new architectures: A Continuous Bag-of-Words model, and a Continuous Skip-gram model.

  • Continuous bag-of-words (CBOW) Model

Consider this line:

“We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks.”

Now picking any chunk from the above line, select a focus word and other words around them as context.

In this case, for example:

[the quality of these]  [representations]   [is measured in a]

      Context                          Focus word                  Context

In the CBOW model the context words form the input to the neural network. If the size of the vocabulary is N, then the inputs are represented in a 1-of-N encoding — with only one element switched to 1 and others to zero. There is a hidden layer and an output layer other than the above presented input layer in this approach. Refer to the following diagram:

CBOW Model
An illustration of CBOW model

The objective is to maximise the conditional probability of observing the actual output word (the focus word) given the input context words, with regard to the weights. In our example, given the input (“the”, “quality”, “of”, “these”, “are”, “measured”, “in”, “a”) we want to maximise the probability of getting “representation” as the output.

Remember our inputs are in one hot encoding, so the result of multiplying it with the weight matrix will simply be selecting a row from the weight matrix.

Therefore, after passing C input word vectors as input, the hidden layer simply does linear activation and passes the weighted sum of the input on to the output layer. At the output layer the error between the target and output layer is calculated and back-propagated to change the weights.

  • The Skip Gram Model

The skip gram model is the mirror image of CBOW model. It is built up by using the focus word as the input vector and the target is to learn the context words.

Skip Gram model
An illustration of a Skip Gram model

The activation function for the hidden layer is simply taking the corresponding row from the weights matrix W1 (linear) as we saw in the CBOW approach.

At the output layer, we have an output of C multinomial distributions. Hence the output is of multiple words. In our example, the input would be “representations”, and the correct answers would be (“the”, “quality”, “of”, “these”, “are”, “measured”, “in”, “a”) at the output layer. Element-wise sum is calculated over all the error vectors to obtain a final error vector. This error is again back-propagated to update the weights of the shallow networks.

Applications Of Word Embeddings:

  1. Words are embedded into a real vector space and that’s why it is very easy to measure the distance between words. This helps in quantifying the relation between words and sentences. 
  2.  Since word embeddings give a very powerful way to create vector representations, many recommendation systems are based on them. Spotify uses it to recommend music and Stitch Fix uses it to predict clothes.
  3. Word vectors can be used to do subtraction and addition operations on the vector. These operations allow us to use them in machine translations and sentiment analysis among other applications.

With the examples mentioned above, many NLP and non-NLP applications have used the techniques proposed in the paper, revolutionising numerous internet services and products.

Download our Mobile App

Abhijeet Katte
As a thorough data geek, most of Abhijeet's day is spent in building and writing about intelligent systems. He also has deep interests in philosophy, economics and literature.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?