TF-IDF is a method which gives us a numerical weightage of words which reflects how important the particular word is to a document in a corpus. A corpus is a collection of documents. Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining.
Tf(Term Frequency): Term frequency can be thought of as how often does a word ‘w’ occur in a document ‘d’. More importance is given to words frequently occurring in a document. The formula of Term frequency is:
IDF(inverse document frequency): Sometimes, words like ‘the’ occur a lot and do not give us vital information regarding the document. To minimize the weight of terms occurring very frequently by incorporating the weight of words rarely occurring in the document. In other words, idf is used to calculate rare words’ weight across all documents in corpus. Words rarely occurring in the corpus will have higher IDF values
Combining these two we come up with the TF-IDF score.
Why Implement Tf-IDF from scratch?
We do know that sklearn has a direct and one line code of the implementation of sklearn.feature_extraction.text.TfidfVectorizer. Then why is there a need for implementing this from scratch?
For some cases, it is done to understand what TFIDF does internally and have a better understanding of it. Knowing the function is one thing and knowing when to use is another.
Another reason might be, in the real world, we tend to play with GBs or TBs of data. So here scikit learn implementation might not be useful or might not give good results. So in such scenarios, we tend to write TFIDFVectorizer from scratch that could handle such huge data.
Using python to implement Tf-IDF
First and foremost is to import all the libraries needed for this.
from collections import Counter from tqdm import tqdm from scipy.sparse import csr_matrix import math import operator from sklearn.preprocessing import normalize import numpy as np
Basic libraries imported.
‘from sklearn.preprocessing import normalize’:- As the documentation says, normalization here means making our data have a unit length, so specifying which length (i.e. which norm) is also required. Here Sklearn applies L2-normalization on its output matrix, i.e. Euclidean length.
corpus = [ 'this is the first document', 'this document is the second document', 'and this is the third one', 'is this the first document', ]
For simplicity, we are taking four reviews or documents as our data corpus and storing them in a list.
def IDF(corpus, unique_words): idf_dict={} N=len(corpus) for i in unique_words: count=0 for sen in corpus: if i in sen.split(): count=count+1 idf_dict[i]=(math.log((1+N)/(count+1)))+1 return idf_dict
We will be defining a function IDF whose parameter will be the corpus and the unique words.
The reason why we are adding ‘1’ to numerator and denominator and also to the whole equation of ‘idf_dict[i]’ is to maintain numerical stability. There might be situations where there are no values, which will generate an error(avoiding division of zeros). So to avoid that error, we are creating numerical stability.
This code snippet will generate the idf values of all the unique words when ‘fit’ function is called.
def fit(whole_data): unique_words = set() if isinstance(whole_data, (list,)): for x in whole_data: for y in x.split(): if len(y)<2: continue unique_words.add(y) unique_words = sorted(list(unique_words)) vocab = {j:i for i,j in enumerate(unique_words)} Idf_values_of_all_unique_words=IDF(whole_data,unique_words) return vocab, Idf_values_of_all_unique_words Vocabulary, idf_of_vocabulary=fit(corpus)
Here we initialised ‘unique_words’ as a set to get all the unique values.(Set has a property where it does not print out duplicate values).
Checking if the ‘whole_data’ is a list or not. In our case, Corpus is a list. We are splitting the list and iterating over the list to find unique words and appending them in the set.
All words having a length of less than two are discarded.
We are calling IDF function inside the fit function which will give us the idf values of all the unique words generated and will store them in ‘idf_values_of_all_unique_words’
The fit function will return the words and their idf values respectively.
We will assign the values to ‘Vocabulary’ and ‘idf_of_vocabulary’
print(list(Vocabulary.keys())) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third','this']
print(list(idf_of_vocabulary.values())) [1.916290731874155, 1.2231435513142097, 1.5108256237659907, 1.0, 1.916290731874155, 1.916290731874155, 1.0, 1.916290731874155, 1.0]
This is the output we will get when we perform the fit function.
We are coding the fit and transform the function of TFIDFVectorizer.
Now jumping towards the transform function.
def transform(dataset,vocabulary,idf_values): sparse_matrix= csr_matrix( (len(dataset), len(vocabulary)), dtype=np.float64) for row in range(0,len(dataset)): number_of_words_in_sentence=Counter(dataset[row].split()) for word in dataset[row].split(): if word in list(vocabulary.keys()): tf_idf_value=(number_of_words_in_sentence[word]/len(dataset[row].split()))*(idf_values[word]) sparse_matrix[row,vocabulary[word]]=tf_idf_value print("NORM FORM\n",normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False)) output =normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False) return output final_output=transform(corpus,Vocabulary,idf_of_vocabulary) print(final_output.shape)
Here we are using the transform function to get a sparse matrix representation output of the corpus.
We used the TF-IDF formula to calculate the values of all the unique words in the set.
As we talked earlier about the l2 norm, here sklearn implements l2 so with the help of ‘normalize’ we initialize l2 norm to get perfect output.
We want the sparse matrix representation so initialised ‘sparse_matrix’ in ‘normalize’
Sparse matrix is a type of matrix with very few non zero values and more zero values. We use sparse matrix only when the matrix has several zero values.
In the final output, we called the initial corpus and the output of the fit function.
Here the output we will get will be in sparse representation.
This is the sparse representation of the matrix.
We can convert the sparse matrix representation to a dense representation or dense matrix.
print(final_output[0].toarray()) [[0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524]]
This is how one would implement TF-IDF in python from scratch.
To compare, we can use the sklearn library and check if the values match or not.
Conclusion
Tf-IDF is one of the most used methods to transform text into numeric form. Here we implemented Tf-IDF from scratch in python, which is very useful when we have tons of data and when sklearn might not give good results.