Data Preprocessing With R: Hands-On Tutorial

When it comes to Machine Learning and Artificial intelligence there are only a few top-performing programming languages to choose from. In the previous tutorial, we learned how to do Data Preprocessing in Python. Since R is among the top performers in Data Science, in this tutorial we will learn to perform Data Preprocessing task with R.

(Note: The following tutorial will require basic programming knowledge of R.)

In this tutorial, we will learn to perform the following operations on a raw dataset:

  • Dealing with missing data
  • Dealing with categorical data
  • Splitting the dataset into training and testing sets
  • Scaling the features

Data Preprocessing in R

The following steps are crucial:

Importing The Dataset

dataset = read.csv('dataset.csv')

As one can see, this is a simple dataset consisting of four features. The dependent factor is the ‘purchased_item’ column. If the above dataset is to be used for machine learning, the idea will be to predict if an item got purchased or not depending on the country, age and salary of a person. Also, the highlighted cells with value ‘NA’ denotes missing values in the dataset.

Dealing With Missing Values

dataset$age = ifelse($age),ave(dataset$age, FUN = function(x) mean(x, na.rm = 'TRUE')),dataset$age)

dataset$salary = ifelse($salary), ave(dataset$salary, FUN = function(x) mean(x, na.rm = 'TRUE')), dataset$salary)

The above code blocks check for missing values in the age and salary columns and update the missing cells with the column-wise average.

  • dataset$column_header: Selects the column in the dataset specified after $ (age and salary).
  •$column_header): This method returns true for all the cells in the specified column with no values.
  • ave(dataset$column_header, FUN = function(x) mean(x, na.rm = ‘TRUE’)): Ths method calculates the average of the column passed as argument.

Output :

dataset$age = as.numeric(format(round(dataset$age, 0)))

Since we are not interested in having decimal places for age we will round it up using the above code. The argument 0 in the round function means no decimal places.

After executing the above code block the dataset would look like what’s shown below :

Note :

  • Unlike Python where we use Numpy arrays to store the data to perform operations, we directly perform our operations on the dataset, which is a list, in R.
  • We do not need to categorize the dependent and independent factors explicitly since R uses an attribute called formula to identify dependent and independent factors from a dataset.

Dealing With Categorical Data

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, educational level etc.

In our dataset, we have two categorical features, nation, and purchased_item. In R we can use the factor method to convert texts into numerical codes.

dataset$nation = factor(dataset$nation, levels = c('India','Germany','Russia'), labels = c(1,2,3))

dataset$purchased_item = factor(dataset$purchased_item, levels = c('No','Yes'),  labels = c(0,1))

  • factor(dataset$olumn_header, levels = c(), labels = c()) : the factor method converts the categorical features in the specified column to factors or numerical codes.
  • levels: the categories in the column passed as a vector. Example c(‘India’,’Germany’,’Russia’)
  • labels: The numerical codes for the specified categories in the same order. Example c(1,2,3))


Splitting The Dataset Into Training And Testing Sets

We will use the caTools library in R  to split our dataset to training_set and test_set

install.packages('caTools') #install once
library(caTools) # importing caTools library
split = sample.split(dataset$purchased_item, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

  • set.seed(): The seed function preserves the uniqueness of the split i.e, for each seed value, the split will be unique. It is similar to the random_state argument in python.
  • sample.split(dataset$dependent_factor, SplitRatio = 0.8): This method will return boolean values with the length of the original dataset  in the specified SplitRatio .0.8 gives 80 percentage Trues and 20 percentage Falses. For example, the above code block will assign the variable split with values [TRUE  TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE]
  • subset(dataset, split == TRUE): This method will return a subset of the dataset passed as an argument where the split is True. (80 percent of the original dataset with respect to the given code)
  • subset(dataset, split == FALSE): This method will return a subset of the dataset passed as an argument where the split is False. (20 percent of the original dataset with respect to the given code)

Scaling The Features

training_set[,3:4] = scale(training_set[,3:4])
test_set[,3:4] = scale(test_set[,3:4])

The scale method in R can be used to scale the features in the dataset. Here we are only scaling the non-factors which are the age and the salary.




Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox