Active Hackathon

How To Perform Set Operations On Pandas DataFrames

Through this article, we will understand what these set operations are and how they are used for comparison. In this experiment, we will first create two data frames and then will perform these sets of operations.

In Data Science we often extract and scrape data from multiple sources. While analyzing this data we come to situations where we need to do a comparison of different data frames, for example, checking what all is different in each of the data frames or what is common in both the data frames. To achieve this we have different ways also known as set operations like Union, Intersection, and Difference. Through this article, we will understand what these set operations are and how they are used for comparison. In this experiment, we will first create two data frames and then will perform these sets of operations. 

What we will learn from this article? 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.
  1. What are Set Operations in Pandas Dataframe?
  2. What is Union Operation? How to perform this?
  3. What is Intersection Operation? How to perform this?
  4. What is the Difference Operation? How to perform this operation?
  1. What are Set Operations? 

Set operations are the mathematical operations that are used for comparison purposes. Consider we have two data frames having 2 columns each containing students with their ID who are enrolled in different courses that are Machine learning, NLP, and Computer Vision. Now we want to look for all the students or students who are in ML but now in NLP and combinations like these. Refer to the below tables for all the three courses.  

How To Perform Set Operations On Pandas DataFrames
How To Perform Set Operations On Pandas DataFrames
How To Perform Set Operations On Pandas DataFrames

             Machine Learning                                   NLP                                                  CV

Now we will create these three tables using pandas. Use the below code to do the same. First, we will import the pandas’ package, and then we will create these tables. 

import pandas as pd
ML_df = pd.DataFrame ({"Name":["Rohit","Arpit","Chiranjeev","Piyush"],
        "Student_ID":["101","102","103","104"]})
NLP_df = pd.DataFrame ({"Name":["Rohit","Aman","Ayush","Piyush"],
        "Student_ID":["101","105","106","104"]})
CV_df = pd.DataFrame ({"Name":["Rohit","Arpit","Pawan","Ayush"],
        "Student_ID":["101","102","107","106"]})
  1. What is Union Operation? How to perform this?

Union operation is an operation that counts everything present in all the tables. Suppose in this case we need to find all the students enrolled in all three courses with their ID then we will make use of Union Operation. 

All Students = ML ∪ NLP ∪ CV 

Use the below code to compute union between all three data frames.

all_students = pd.concat([ML_df,NLP_df,CV_df], ignore_index = True)

all_students = all_students.drop_duplicates()

print(all_students)

Output: 

How To Perform Set Operations On Pandas DataFrames
  1. What is Intersection Operation? How to perform this?

The intersection is opposite of union where we only keep the common between the two data frames. Consider we have to pick those students that are enrolled for both ML and NLP courses or students that are there in ML and CV. Refer to the below to code to understand how to compute the intersection between two data frames. 

Common_ML_NLP = ML ∩ NLP 

Common_ML_NLP = ML_df.merge(NLP_df)

print(Common_ML_NLP)

Output:

How To Perform Set Operations On Pandas DataFrames

Common_ML_CV = ML ∩ CV

Common_ML_CV = ML_df.merge(CV_df)

print(Common_ML_CV)

Output:

  1. What is the Difference Operation? How to perform this operation?

It is the type of operation that is done on a data frame to pick the data that is not common in both the data frame or the difference in the two. Consider in this case we need to find students that are only present in ML or NLP. That means we need to compute data that is uncommon in both the data frames. Refer to the below code to compute the same. 

ML_NLP = ML_df[ML_df.Student_ID.isin(NLP_df.Student_ID) == False]

print(ML_NLP) 

Output:

ML_CV = ML_df[ML_df.Student_ID.isin(CV_df.Student_ID) == False]

print(ML_CV) 

Output:

Conclusion 

In this article, we discussed the basic set of operations of pandas that are performed between different data frames to compute similarity, dissimilarity, and common data between the data frame. We first checked the union operation followed by intersection and different operations. These are very useful sets of operations that are used to manipulate your data frames well and understand the data. 

More Great AIM Stories

Rohit Dwivedi
I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter