Now Reading
How To Perform Set Operations On Pandas DataFrames

How To Perform Set Operations On Pandas DataFrames

Rohit Dwivedi
W3Schools

In Data Science we often extract and scrape data from multiple sources. While analyzing this data we come to situations where we need to do a comparison of different data frames, for example, checking what all is different in each of the data frames or what is common in both the data frames. To achieve this we have different ways also known as set operations like Union, Intersection, and Difference. Through this article, we will understand what these set operations are and how they are used for comparison. In this experiment, we will first create two data frames and then will perform these sets of operations. 

What we will learn from this article? 

  1. What are Set Operations in Pandas Dataframe?
  2. What is Union Operation? How to perform this?
  3. What is Intersection Operation? How to perform this?
  4. What is the Difference Operation? How to perform this operation?
  1. What are Set Operations? 

Set operations are the mathematical operations that are used for comparison purposes. Consider we have two data frames having 2 columns each containing students with their ID who are enrolled in different courses that are Machine learning, NLP, and Computer Vision. Now we want to look for all the students or students who are in ML but now in NLP and combinations like these. Refer to the below tables for all the three courses.  



How To Perform Set Operations On Pandas DataFrames
How To Perform Set Operations On Pandas DataFrames
How To Perform Set Operations On Pandas DataFrames

             Machine Learning                                   NLP                                                  CV

Now we will create these three tables using pandas. Use the below code to do the same. First, we will import the pandas’ package, and then we will create these tables. 

import pandas as pd
ML_df = pd.DataFrame ({"Name":["Rohit","Arpit","Chiranjeev","Piyush"],
        "Student_ID":["101","102","103","104"]})
NLP_df = pd.DataFrame ({"Name":["Rohit","Aman","Ayush","Piyush"],
        "Student_ID":["101","105","106","104"]})
CV_df = pd.DataFrame ({"Name":["Rohit","Arpit","Pawan","Ayush"],
        "Student_ID":["101","102","107","106"]})
  1. What is Union Operation? How to perform this?

Union operation is an operation that counts everything present in all the tables. Suppose in this case we need to find all the students enrolled in all three courses with their ID then we will make use of Union Operation. 

All Students = ML ∪ NLP ∪ CV 

Use the below code to compute union between all three data frames.

all_students = pd.concat([ML_df,NLP_df,CV_df], ignore_index = True)

all_students = all_students.drop_duplicates()

print(all_students)

Output: 

How To Perform Set Operations On Pandas DataFrames
  1. What is Intersection Operation? How to perform this?

The intersection is opposite of union where we only keep the common between the two data frames. Consider we have to pick those students that are enrolled for both ML and NLP courses or students that are there in ML and CV. Refer to the below to code to understand how to compute the intersection between two data frames. 

Common_ML_NLP = ML ∩ NLP 

Common_ML_NLP = ML_df.merge(NLP_df)

print(Common_ML_NLP)

Output:

How To Perform Set Operations On Pandas DataFrames

Common_ML_CV = ML ∩ CV

Common_ML_CV = ML_df.merge(CV_df)

print(Common_ML_CV)

See Also
Bamboolib For visualizing pandas

Output:

  1. What is the Difference Operation? How to perform this operation?

It is the type of operation that is done on a data frame to pick the data that is not common in both the data frame or the difference in the two. Consider in this case we need to find students that are only present in ML or NLP. That means we need to compute data that is uncommon in both the data frames. Refer to the below code to compute the same. 

ML_NLP = ML_df[ML_df.Student_ID.isin(NLP_df.Student_ID) == False]

print(ML_NLP) 

Output:

ML_CV = ML_df[ML_df.Student_ID.isin(CV_df.Student_ID) == False]

print(ML_CV) 

Output:

Conclusion 

In this article, we discussed the basic set of operations of pandas that are performed between different data frames to compute similarity, dissimilarity, and common data between the data frame. We first checked the union operation followed by intersection and different operations. These are very useful sets of operations that are used to manipulate your data frames well and understand the data. 

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top