Active Hackathon

Data Preprocessing With R: Hands-On Tutorial

When it comes to Machine Learning and Artificial intelligence there are only a few top-performing programming languages to choose from. In the previous tutorial, we learned how to do Data Preprocessing in Python. Since R is among the top performers in Data Science, in this tutorial we will learn to perform Data Preprocessing task with R.

(Note: The following tutorial will require basic programming knowledge of R.)


Sign up for your weekly dose of what's up in emerging technology.

In this tutorial, we will learn to perform the following operations on a raw dataset:

  • Dealing with missing data
  • Dealing with categorical data
  • Splitting the dataset into training and testing sets
  • Scaling the features

Data Preprocessing in R

The following steps are crucial:

Importing The Dataset

dataset = read.csv('dataset.csv')

As one can see, this is a simple dataset consisting of four features. The dependent factor is the ‘purchased_item’ column. If the above dataset is to be used for machine learning, the idea will be to predict if an item got purchased or not depending on the country, age and salary of a person. Also, the highlighted cells with value ‘NA’ denotes missing values in the dataset.

Dealing With Missing Values

dataset$age = ifelse($age),ave(dataset$age, FUN = function(x) mean(x, na.rm = 'TRUE')),dataset$age)

dataset$salary = ifelse($salary), ave(dataset$salary, FUN = function(x) mean(x, na.rm = 'TRUE')), dataset$salary)

The above code blocks check for missing values in the age and salary columns and update the missing cells with the column-wise average.

  • dataset$column_header: Selects the column in the dataset specified after $ (age and salary).
  •$column_header): This method returns true for all the cells in the specified column with no values.
  • ave(dataset$column_header, FUN = function(x) mean(x, na.rm = ‘TRUE’)): Ths method calculates the average of the column passed as argument.

Output :

dataset$age = as.numeric(format(round(dataset$age, 0)))

Since we are not interested in having decimal places for age we will round it up using the above code. The argument 0 in the round function means no decimal places.

After executing the above code block the dataset would look like what’s shown below :

Note :

  • Unlike Python where we use Numpy arrays to store the data to perform operations, we directly perform our operations on the dataset, which is a list, in R.
  • We do not need to categorize the dependent and independent factors explicitly since R uses an attribute called formula to identify dependent and independent factors from a dataset.

Dealing With Categorical Data

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, educational level etc.

In our dataset, we have two categorical features, nation, and purchased_item. In R we can use the factor method to convert texts into numerical codes.

dataset$nation = factor(dataset$nation, levels = c('India','Germany','Russia'), labels = c(1,2,3))

dataset$purchased_item = factor(dataset$purchased_item, levels = c('No','Yes'),  labels = c(0,1))

  • factor(dataset$olumn_header, levels = c(), labels = c()) : the factor method converts the categorical features in the specified column to factors or numerical codes.
  • levels: the categories in the column passed as a vector. Example c(‘India’,’Germany’,’Russia’)
  • labels: The numerical codes for the specified categories in the same order. Example c(1,2,3))


Splitting The Dataset Into Training And Testing Sets

We will use the caTools library in R  to split our dataset to training_set and test_set

install.packages('caTools') #install once
library(caTools) # importing caTools library
split = sample.split(dataset$purchased_item, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

  • set.seed(): The seed function preserves the uniqueness of the split i.e, for each seed value, the split will be unique. It is similar to the random_state argument in python.
  • sample.split(dataset$dependent_factor, SplitRatio = 0.8): This method will return boolean values with the length of the original dataset  in the specified SplitRatio .0.8 gives 80 percentage Trues and 20 percentage Falses. For example, the above code block will assign the variable split with values [TRUE  TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE]
  • subset(dataset, split == TRUE): This method will return a subset of the dataset passed as an argument where the split is True. (80 percent of the original dataset with respect to the given code)
  • subset(dataset, split == FALSE): This method will return a subset of the dataset passed as an argument where the split is False. (20 percent of the original dataset with respect to the given code)

Scaling The Features

training_set[,3:4] = scale(training_set[,3:4])
test_set[,3:4] = scale(test_set[,3:4])

The scale method in R can be used to scale the features in the dataset. Here we are only scaling the non-factors which are the age and the salary.




More Great AIM Stories

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact:

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.