How to do R-like data manipulations using Pandas?

R and Python play a crucial role in handling and manipulating the data. Many beginners find it difficult to shift from Python to R or vice-versa in such requirements. This will help the beginners to understand the differences and also help them switch in between.

R and Python play a crucial role in handling and manipulating the data. Many beginners find it difficult to shift from Python to R or vice-versa in such requirements. But it needs to be understood how common both approaches are. There are many data manipulation tasks done in R that can also be done using Pandas in python. In this article, we are going to discuss a comparison between data manipulation using R and Pandas based on some of the important functions and features. This will help the beginners to understand the differences and also help them switch in between. The major points to be discussed in the article are listed below.

Table of contents

  1. About Pandas and R
  2. Comparing the data operations
  3. R Vs Pandas for Data Manipulation

About Pandas and R

Let’s have a brief introduction to both R and Pandas.

The R Programming Language

We can think of R as an implementation of S language that is a specially designed language and environment for statistical and graphical analysis of the data. Using the R  language we can utilize a variety of statistical analysis techniques like linear or nonlinear modelling, testing, clustering, classification, etc. this language also provides various features using which we can also perform graphical analysis. Using the R language we can produce highly interactive plots of any data. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In this article, we are going to discuss the tools or package of R language that can be used for data manipulation. 

About Pandas

Pandas is a library in python for many data-related tasks such as data manipulation and conversion. We use data with Pandas that are in the form of tabular. With these tasks, we can also use Pandas for data warehousing using Pandasql. Function under the Pandas can be used for inspecting data when we are moving the data in or out from the process. 


Download our Mobile App



By looking at the above points we can say that Pandas is a toolkit or library in python and talking about R it is a language in itself and possesses many toolkits under it for performing data-related tasks. In this article, we are going to compare R language and Pandas library based on the data-related tasks.

Let’s start the comparison.

Comparing the data operations

As a practitioner of data science, we are required to use python and R language on a regular basis to perform data-related tasks. Using this section of the article we will get to know how we can perform the different operations toolkits of using R language and Pandas library in python language. 

In R, we use mainly dplyr toolkit for querying, filtering, and sampling operations. The below table showcases different methods that we use for the above given simple operations using dplyr and Pandas toolkit.

RPandas
dim(data)data.shape
head(data)data.head()
slice(data, 1:10)data.iloc[:9]
filter(data, col1 == 1, col2 == 1)data.query(‘col1 == 1 & col2 == 1’)
data[data$col1 == 1 & data$col2 == 1,]data[(data.col1 == 1) & (data.col2 == 1)]
select(data, col1, col2)data[[‘col1’, ‘col2’]]
select(data, col1:col3)data.loc[:, ‘col1′:’col3’]
distinct(select(data, col1))data[[‘col1’]].drop_duplicates()
select(data, -(col1:col3))data.drop(cols_to_drop, axis=1)
distinct(select(data, col1, col2))data[[‘col1’, ‘col2’]].drop_duplicates()
sample_n(data, 10)data.sample(n=10)
sample_frac(data, 0.01)data.sample(frac=0.01)

Let’s see the difference between R(dplyr) and Pandas based on the sorting operation.

RPandas
arrange(data, col1, col2)data.sort_values([‘col1’, ‘col2’])
arrange(data, desc(col1))data.sort_values(‘col1’, ascending=False)

Let’s see the difference between R(dplyr) and Pandas based on the transforming operation.

RPandas
select(data, col_one = col1)data.rename(columns={‘col1’: ‘col_one’})[‘col_one’]
mutate(data, c=a-b)data.assign(c=data[‘a’]-data[‘b’])
rename(data, col_one = col1)data.rename(columns={‘col1’: ‘col_one’})

Let’s see the difference between R(dplyr) and Pandas based on the group-by and summary operation.

RPandas
summary(data)data.describe()
gdata <- group_by(data, col1)gdata = data.groupby(‘col1’)
summarise(gdata, avg=mean(col1, na.rm=TRUE))data.groupby(‘col1’).agg({‘col1’: ‘mean’})
summarise(gdata, total=sum(col1))data.groupby(‘col1’).sum()

Slicing 

We can perform slicing operations like column selection using the c() function in R. In python, we can do that using Pandas. For example, the below codes can be used in R for selecting and accessing columns using the column name or by location in integer.

Using column name

data <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
data[, c("a", "c", "e")]

Using integer location

data <- data.frame(matrix(rnorm(1000), ncol=100))
data[, c(1:10, 25:30, 40, 50:100)]

In Pandas, we can do the same operation using the following lines of codes.

import pandas as pd
import numpy as np
datacolumns=list("abc")
data = pd.DataFrame(np.random.randn(5, 3), columns=columns)
data

Output:

Using column name

data[["a", "c"]]

Output:

Using the location 

data.loc[:, ["a", "c"]]

Output:

Aggregation 

Using the R language we group by gata for making subsets and calculating mean of each subset using the by1 and by2 functions as following:

data <- data.frame(
  by1 = c("abc", "bdc", 1, 2, "abc", "bcd", 1, 2, "rfg", 1, "abc", 12),
  by2 = c("bac","cbd",99,95,"bac","xyz",95,99,"abc",99,"abc","abc")
  v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
  v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99))
aggregate(x=data[, c("v1", "v2")], by=list(mydata2$by1, mydata2$by2), FUN = mean)

Using the Pandas we can perform such operation using the following way:

data = pd.DataFrame(
    {
        "by1": ["abc", "bdc", 1, 2, "abc", "bcd", 1, 2, "rfg", 1, 'abc', 12],
        "by2": ["bac","cbd",99,95,"bac","xyz",95,99,"abc",99,"abc",'abc',],
        "v1": [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
        "v2": [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
    }
)
 
data

Output:

g = data.groupby(["by1", "by2"])
g[["v1", "v2"]].mean()

Output:

Matching function

In R language we can select the data using the function %ln% that can be defined using the module match in the following ways:

<- 0:9
s %in% c(4,6)

Using Pandas we can do it using the isin() function in the following ways:

s = pd.Series(np.arange(10), dtype=np.float32)
s.isin([4, 6])

output:

Query function

In R language we are required to use the subset function to perform conditional queries with the data set.  The below code is an example of this function.

data <- data.frame(a=rnorm(15), b=rnorm(15))
subset(data, a >= b)
data[data$a >= data$b,]

Where we are extracting rows where the value of column a is smaller and equal to column b.

Using the Pandas we can perform this operation using the query function.

data = pd.DataFrame({"a": np.random.randn(15), "b": np.random.randn(15)})
data.query("a >= b")

Output:

R Vs Pandas for Data Manipulation

Using the above points, we have discussed how we can perform various data analyses using Pandas in python and toolkits of R. We found that in R the packages are spread around the language and we are required to install them separately in our local machine. When we use Pandas for similar purposes we can have all the functions in a managed sense or we can say these functions are in a single place, we don’t need to look for the other tools. One thing which R language is good for data analytics is the speed of R and its interface that is much more user-friendly than Pandas. About the R language, we can say it is less complex than the python language. Both R and Pandas are best at their places.  

Final words 

Here in this article, we have discussed the comparison of R and Pandas. In conclusion, we can say that R is a programming language whereas Pandas is a library. Using the packages of R, we can perform different operations where Pandas helps us to perform different operations. This tutorial will help beginners to understand the difference between the two and also help in migrating easily.

References:

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

The Great Indian IT Reshuffling

While both the top guns of TCS and Tech Mahindra are reflecting rather positive signs to the media, the reason behind the resignations is far more grave.

OpenAI, a Data Scavenging Company for Microsoft

While it might be true that the investment was for furthering AI research, this partnership is also providing Microsoft with one of the greatest assets of this digital age, data​​, and—perhaps to make it worse—that data might be yours.