Now Reading
NLP case study: Identify Documents Similarity


NLP case study: Identify Documents Similarity


Comparison between things, like clothes, food, products and even people, is an integral part of our everyday life. It is done by assessing similarity (or differences) between two or more things. Apart from its usual usage as an aid in selecting a thing-product, the comparisons are useful in searching things ‘similar’ to what you have and in classifying things based on similarity. This post describes a specific use-case of finding similarity between two documents.

Measuring Similarity

Measure of similarity can be qualitative and/or quantitative. In qualitative, the assessment is done against subjective criteria such as theme, sentiment, overall meaning, etc. In the quantitative, numerical parameters such as length of the document, number of keywords, common words, etc. are compared. The process is carried out in two steps, as mentioned below:



  • Vectorization: Transform the documents into a vector of numbers. Following are some of the popular numbers(measures): TF (Term Frequency), IDF (Inverse Document Frequency) and TF*IDF.
  • Distance Computation: Compute the cosine similarity between the document vector. As we know, the cosine (dot product) of the same vectors is 1, dissimilar/perpendicular ones are 0, so the dot product of two vector-documents is some value between 0 and 1, which is the measure of similarity amongst them.

Test-case used in this post is of finding similarity between two news reports [^1, ^2] of a recent bus accident (Sources mentioned in the References). Programming language ‘Python’ and its Natural Language Toolkit library ‘nltk’ [^3] are primarily used here. The similarity analysis is done in steps as mentioned below.

Documents Pre-Processing

The news reports contain many things which are not core (or irrelevant) for text analysis exercise such as finding similarity. So, they are pre-processed by converting their words into lower case and removing the ‘stopwords’, like ‘the’, ‘should’, etc.

untitled

Vectorization

Characterize each text as a vector. Each text has some common and some uncommon words compared to each other. To account for all possibilities, a word set is formed which consists of words from both the documents. There are various methods by which words can be vectorised, meaning, converted to vectors (array of numbers). A few of the prominent ones are explained below.

Frequency Count Method

A simplest way to create the vectors is to count number of times each word from the common word set, occurs in individual document.FreqDist counts the number of occurrence of a word in the given text. So, in the above code snippet text1_count_dict has word-count pairs of all the words from the common word_set, along with their individual counts. Following table shows few words with their frequencies:

untitledFreqDist counts the number of occurrence of a word in the given text. So, in the above code snippet text1_count_dict has word-count pairs of all the words from the common word_set, along with their individual counts. Following table shows few words with their frequencies:

westbound whether windows workers worse would years
text1 1 0 1 0 0 1 0
text2 1 1 0 1 1 1 1

These vectors, in a crude way, represent their respective texts and similarity can be assessed amongst them. This is the ‘Containment Ratio’ method mentioned above. TF-IDF is much better measure to represent a document.

TF-IDF Method

TF is document specific. It is a way to score the importance of words (or "terms") in a document based on how frequently they appear. If a word appears frequently in a document, it's important, it gets a high score. Although it is easy to compute, it is ambiguous (‘green’ the colour and ‘green’ the person’s name is not differentiated).

untitled
IDF is for the whole collection. It is a way to score how many times a word occurs across multiple documents. If a word appears in many documents, it's not a unique identifier, thus gets a lower score.
untitled

TFIDF of a word = (TF of the word) * (IDF of the word)

untitled

Word Embedding Method

Of-late Word embedding are being used to vectorise words, and using that the whole documents. Google’s Word2Vec and Doc2Vec available from Python’s genism library [^6] can be used to vectorise the news reports and then find similarity between them.
untitled

Once the words in the text are vectorised, the similarity score between them is nothing but the ‘distance’ between them.

See Also

Distance computation

Following are the steps to compute the similarity of two texts using TF-IDF Method. It is computed using the dot product of given vectors v1 and v2.

untitled

For the given two news items the similarity score came to about 72.62 %.

untitled
In case of Word Embedding method, the Doc2Vec model itself can compute similarity of given texts. For the given two news items the similarity score came to about 79.06 %.

Conclusion

TFIDF and Doc2Vec are thus some of the quick measures of assessing the similarity of documents. But both are rather crude. Further refinement can be brought to this analysis using topic modelling, thematic summarization of the news items, etc.


References

  1. News Source 1: http://www.ndtv.com/world-news/at-least-13-killed-in-california-tour-bus-crash-report-1478120?pfrom=home-topstories
  2. News Source 2: http://www.foxnews.com/us/2016/10/23/3-dead-in-california-tour-bus-semi-truck-collision.html
  3. Nltk : nltk.org https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
  4. Computing Document Similarity with NLTK (March 2014) https://www.youtube.com/watch?v=FfLo5OHBwTo
  5. Tutorial: Finding Important Words in Text Using TF-IDF http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/
  6. Gensim : Word2Vec, Doc2Vec https://radimrehurek.com/gensim/models/doc2vec.html



Register for our upcoming events:


Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Scroll To Top