Listen to this story
Cosine similarity is a measure of similarity between two data points in a plane. Cosine similarity is used as a metric in different machine learning algorithms like the KNN for determining the distance between the neighbors, in recommendation systems, it is used to recommend movies with the same similarities and for textual data, it is used to find the similarity of texts in the document. So in this article let us understand why cosine similarity is a popular metric for evaluation in various applications.
Table of Contents
- About cosine similarity
- Why is cosine similarity a popular metric?
- Use of cosine similarity in machine learning
- Use of cosine similarity in recommendation systems
- Use of cosine similarity with textual data
About cosine similarity
Cosine similarity is the cosine of the angle between two vectors and it is used as a distance evaluation metric between two points in the plane. The cosine similarity measure operates entirely on the cosine principles where with the increase in distance the similarity of data points reduces.
Cosine similarity finds its major use for character types of data wherein with respect to machine learning cosine similarity can be used for various classification data and helps us to determine the nearest neighbors when used as an evaluation metric in the KNN algorithm. Cosine similarity in the recommendation system is used with the same principle of cosine angles, where even if the similarity of the content is less similar it would be considered as the least recommended content, and for higher similarity of contents, the recommendations generated would be at the top. Cosine similarity is also used in textual data to find the similarity between the vectorized texts from the original text document.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Why is cosine similarity a popular metric?
There are various distance measures that are used as a metric for the evaluation of data points. Some of them are as follows.
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Hamming distance and many more.
Among all these popular metrics for distance calculation and when considered for classification or text data instead of cosine similarity, Hamming distance can be used as a metric for KNN, recommendation systems, and textual data. But hamming distance considers only the character type of data of the same length but cosine similarity has the ability to handle variable length data. When considering textual data the Hamming distance would not consider the frequently occurring words in the document and would be responsible for yielding a lower similarity index from the text document while cosine similarity considers the frequently occurring words in the text document and will help in yielding higher similarity scores for the text data.
Use of cosine similarity in machine learning
Cosine similarity in machine learning can be used for classification tasks wherein it can be used as a metric in the KNN classification algorithms to find the optimal number of neighbors and also the KNN model that is fitted can be evaluated against different classification machine learning algorithms and the KNN classifier alone that is fitted with cosine similarity as a metric can be used to evaluate various performance parameters like the accuracy score, AUC score, and the classification report can also be obtained to evaluate other parameters like precision and recall.
Let us see how to use cosine similarity as a metric in machine learning
The above model can be fitted against the split data and can be used to obtain prediction values that can be used for various other parameters.
So cosine similarity in machine learning can be used as a metric for deciding the optimal number of neighbors where the data points with a higher similarity will be considered as the nearest neighbors and the data points with lower similarity will not be considered. So this is how cosine similarity is used in machine learning.
Use of cosine similarity in recommendation systems
Recommendation systems in machine learning are one such algorithm that works based on the similarity of contents. There are various ways to measure the similarity between the two contents and recommendation systems basically use the similarity matrix to recommend the similar content to the user based on his accessing characteristics.
So any recommendation data can be acquired and the required features that would be useful for recommending the contents can be taken out from the data. Once the required textual data is available the textual data has to be vectorized using the CountVectorizer to obtain the similarity matrix. So once the similarity matrix is obtained the cosine similarity metrics of scikit learn can be used to recommend the user.
from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity count_vec=CountVectorizer() sim_matrix=count_vec.fit_transform(df['text_data']) print('Similarity Matrix',sim_matrix.toarray()) cos_sim = cosine_similarity(sim_matrix)
So the cosine similarity would yield a similarity matrix for the selected textual data for recommendation and the content with higher similarity scores can be sorted using lists. Here cosine similarity would consider the frequently occurring terms in the textual data and that terms would be vectorized with higher frequencies and that content would be recommended with higher recommendation percentages. So this is how cosine similarity is used in recommendation systems.
Use of cosine similarity with textual data
Cosine similarity in textual data is used to compare the similarity between two text documents or tokenized texts. So in order to use cosine similarity in text data, the raw text data has to be tokenized at the initial stage, and from the tokenized text data a similarity matrix has to be generated which can be passed on to the cosine similarity metrics for evaluating the similarity between the text document.
from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer() sim_matrix = count_vectorizer.fit_transform(tokenized_data) sim_matrix from sklearn.metrics.pairwise import cosine_similarity cos_sim_matrix = cosine_similarity(sim_matrix) create_dataframe(cos_sim_matrix,tokenized_data[1:3]) ## using the first two tokenized data
So the above code can be used to measure the similarity between the tokenized document and here the first two tokenized documents from the corpus is used to evaluate the similarity between them and the output generated will be as shown below.
Now let us try to interpret the sample output that will be produced by the cosine similarity metrics. So here cosine similarity would consider the frequently occurring words between the two tokens and it has yielded a 50% similarity between the first and the second token in the corpus. So this is how cosine similarity is used in the textual data.
Among the various metrics, cosine similarity is majorly used in various tasks of machine learning and in handling textual data because of its dynamic ability to adapt to various characteristics of data. Cosine similarity entirely operates on the cosine angle properties and it is vastly used in recommendation systems as it will help us recommend content to the user according to his most viewed content and characteristics and is also majorly used in finding the similarity between text documents as it considers the frequently occurring terms. This made cosine similarity a popular metric for evaluation in various applications.