MITB Banner

Hands-on implementation of TF-IDF from scratch in Python

TF-IDF is a method which gives us a numerical weightage of words which reflects how important the particular word is to a document in a corpus. A corpus is a collection of documents. Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining.
Share

TF-IDF is a method which gives us a numerical weightage of words which reflects how important the particular word is to a document in a corpus. A corpus is a collection of documents. Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining.

Tf(Term Frequency): Term frequency can be thought of as how often does a word ‘w’ occur in a document ‘d’. More importance is given to words frequently occurring in a document. The formula of Term frequency is:

IDF(inverse document frequency): Sometimes, words like ‘the’ occur a lot and do not give us vital information regarding the document. To minimize the weight of terms occurring very frequently by incorporating the weight of words rarely occurring in the document. In other words, idf is used to calculate rare words’ weight across all documents in corpus. Words rarely occurring in the corpus will have higher IDF values

Combining these two we come up with the TF-IDF score.

Why Implement Tf-IDF from scratch?

We do know that sklearn has a direct and one line code of the implementation of sklearn.feature_extraction.text.TfidfVectorizer. Then why is there a need for implementing this from scratch?

For some cases, it is done to understand what TFIDF does internally and have a better understanding of it. Knowing the function is one thing and knowing when to use is another.

Another reason might be, in the real world, we tend to play with GBs or TBs of data. So here scikit learn implementation might not be useful or might not give good results. So in such scenarios, we tend to write TFIDFVectorizer from scratch that could handle such huge data.

Using python to implement Tf-IDF

First and foremost is to import all the libraries needed for this.

 from collections import Counter
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 import math
 import operator
 from sklearn.preprocessing import normalize
 import numpy as np 

Basic libraries imported. 

‘from sklearn.preprocessing import normalize’:- As the documentation says, normalization here means making our data have a unit length, so specifying which length (i.e. which norm) is also required. Here Sklearn applies L2-normalization on its output matrix, i.e. Euclidean length.

 corpus = [
      'this is the first document',
      'this document is the second document',
      'and this is the third one',
      'is this the first document',
 ] 

For simplicity, we are taking four reviews or documents as our data corpus and storing them in a list.

 def IDF(corpus, unique_words):
   idf_dict={}
   N=len(corpus)
   for i in unique_words:
     count=0
     for sen in corpus:
       if i in sen.split():
         count=count+1
       idf_dict[i]=(math.log((1+N)/(count+1)))+1
   return idf_dict 

We will be defining a function IDF whose parameter will be the corpus and the unique words.

The reason why we are adding ‘1’ to numerator and denominator and also to the whole equation of ‘idf_dict[i]’ is to maintain numerical stability. There might be situations where there are no values, which will generate an error(avoiding division of zeros). So to avoid that error, we are creating numerical stability.

This code snippet will generate the idf values of all the unique words when ‘fit’ function is called.

 def fit(whole_data):
    unique_words = set()
    if isinstance(whole_data, (list,)):
      for x in whole_data:
        for y in x.split():
          if len(y)<2:
            continue
          unique_words.add(y)
      unique_words = sorted(list(unique_words))
      vocab = {j:i for i,j in enumerate(unique_words)}
      Idf_values_of_all_unique_words=IDF(whole_data,unique_words)
    return vocab, Idf_values_of_all_unique_words
 Vocabulary, idf_of_vocabulary=fit(corpus) 

Here we initialised ‘unique_words’ as a set to get all the unique values.(Set has a property where it does not print out duplicate values).

Checking if the ‘whole_data’ is a list or not. In our case, Corpus is a list. We are splitting the list and iterating over the list to find unique words and appending them in the set.

All words having a length of less than two are discarded.

We are calling IDF function inside the fit function which will give us the idf values of all the unique words generated and will store them in ‘idf_values_of_all_unique_words’ 

The fit function will return the words and their idf values respectively.

We will assign the values to ‘Vocabulary’ and ‘idf_of_vocabulary’ 

 print(list(Vocabulary.keys())) 
 ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third','this'] 
 print(list(idf_of_vocabulary.values()))
 [1.916290731874155, 1.2231435513142097, 1.5108256237659907, 1.0, 1.916290731874155, 1.916290731874155, 1.0, 1.916290731874155, 1.0] 

This is the output we will get when we perform the fit function.

We are coding the fit and transform the function of TFIDFVectorizer.

Now jumping towards the transform function.

 def transform(dataset,vocabulary,idf_values):
     sparse_matrix= csr_matrix( (len(dataset), len(vocabulary)), dtype=np.float64)
     for row  in range(0,len(dataset)):
       number_of_words_in_sentence=Counter(dataset[row].split())
       for word in dataset[row].split():
           if word in  list(vocabulary.keys()):
               tf_idf_value=(number_of_words_in_sentence[word]/len(dataset[row].split()))*(idf_values[word])
               sparse_matrix[row,vocabulary[word]]=tf_idf_value
     print("NORM FORM\n",normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False))
     output =normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False)
     return output
 final_output=transform(corpus,Vocabulary,idf_of_vocabulary)
 print(final_output.shape) 

Here we are using the transform function to get a sparse matrix representation output of the corpus.

We used the TF-IDF formula to calculate the values of all the unique words in the set.

As we talked earlier about the l2 norm, here sklearn implements l2 so with the help of ‘normalize’  we initialize l2 norm to get perfect output. 

We want the sparse matrix representation so initialised ‘sparse_matrix’ in ‘normalize’ 

Sparse matrix is a type of matrix with very few non zero values and more zero values. We use sparse matrix only when the matrix has several zero values.

In the final output, we called the initial corpus and the output of the fit function.

Here the output we will get will be in sparse representation.

This is the sparse representation of the matrix. 

We can convert the sparse matrix representation to a dense representation or dense matrix.

 print(final_output[0].toarray())
 [[0.         0.46979139 0.58028582 0.38408524 0.         0.
   0.38408524 0.         0.38408524]] 

This is how one would implement TF-IDF in python from scratch.

To compare, we can use the sklearn library and check if the values match or not.

Conclusion

Tf-IDF is one of the most used methods to transform text into numeric form. Here we implemented Tf-IDF from scratch in python, which is very useful when we have tons of data and when sklearn might not give good results.

PS: The story was written using a keyboard.
Share
Picture of Vihaan Shah

Vihaan Shah

Vihaan Shah is a Data Scientist who loves to play around with data. He is an inquisitive person and AI/ML enthusiast. On a mission to 'Script' his way into the programming world. Currently working as Data science content writer at Analytics India Magazine.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India