# Data Preprocessing With R: Hands-On Tutorial

When it comes to Machine Learning and Artificial intelligence there are only a few top-performing programming languages to choose from. In the previous tutorial, we learned how to do Data Preprocessing in Python. Since R is among the top performers in Data Science, in this tutorial we will learn to perform Data Preprocessing task with R.

(Note: The following tutorial will require basic programming knowledge of R.)

In this tutorial, we will learn to perform the following operations on a raw dataset:

• Dealing with missing data
• Dealing with categorical data
• Splitting the dataset into training and testing sets
• Scaling the features

### Data Preprocessing in R

The following steps are crucial:

#### Importing The Dataset

`dataset = read.csv('dataset.csv')`

As one can see, this is a simple dataset consisting of four features. The dependent factor is the ‘purchased_item’ column. If the above dataset is to be used for machine learning, the idea will be to predict if an item got purchased or not depending on the country, age and salary of a person. Also, the highlighted cells with value ‘NA’ denotes missing values in the dataset.

#### Dealing With Missing Values

`dataset\$age = ifelse(is.na(dataset\$age),ave(dataset\$age, FUN = function(x) mean(x, na.rm = 'TRUE')),dataset\$age)`

`dataset\$salary = ifelse(is.na(dataset\$salary), ave(dataset\$salary, FUN = function(x) mean(x, na.rm = 'TRUE')), dataset\$salary)`

The above code blocks check for missing values in the age and salary columns and update the missing cells with the column-wise average.

• dataset\$column_header: Selects the column in the dataset specified after \$ (age and salary).
• is.na(dataset\$column_header): This method returns true for all the cells in the specified column with no values.
• ave(dataset\$column_header, FUN = function(x) mean(x, na.rm = ‘TRUE’)): Ths method calculates the average of the column passed as argument.

Output :

`dataset\$age = as.numeric(format(round(dataset\$age, 0)))`

Since we are not interested in having decimal places for age we will round it up using the above code. The argument 0 in the round function means no decimal places.

After executing the above code block the dataset would look like what’s shown below :

Note :

• Unlike Python where we use Numpy arrays to store the data to perform operations, we directly perform our operations on the dataset, which is a list, in R.
• We do not need to categorize the dependent and independent factors explicitly since R uses an attribute called formula to identify dependent and independent factors from a dataset.

#### Dealing With Categorical Data

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, educational level etc.

In our dataset, we have two categorical features, nation, and purchased_item. In R we can use the factor method to convert texts into numerical codes.

`dataset\$nation = factor(dataset\$nation, levels = c('India','Germany','Russia'), labels = c(1,2,3))`

`dataset\$purchased_item = factor(dataset\$purchased_item, levels = c('No','Yes'),  labels = c(0,1))`

• factor(dataset\$olumn_header, levels = c(), labels = c()) : the factor method converts the categorical features in the specified column to factors or numerical codes.
• levels: the categories in the column passed as a vector. Example c(‘India’,’Germany’,’Russia’)
• labels: The numerical codes for the specified categories in the same order. Example c(1,2,3))

Output:

#### Splitting The Dataset Into Training And Testing Sets

We will use the caTools library in R  to split our dataset to training_set and test_set

```install.packages('caTools') #install once library(caTools) # importing caTools library set.seed(123) split = sample.split(dataset\$purchased_item, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)```

• set.seed(): The seed function preserves the uniqueness of the split i.e, for each seed value, the split will be unique. It is similar to the random_state argument in python.
• sample.split(dataset\$dependent_factor, SplitRatio = 0.8): This method will return boolean values with the length of the original dataset  in the specified SplitRatio .0.8 gives 80 percentage Trues and 20 percentage Falses. For example, the above code block will assign the variable split with values [TRUE  TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE]
• subset(dataset, split == TRUE): This method will return a subset of the dataset passed as an argument where the split is True. (80 percent of the original dataset with respect to the given code)
• subset(dataset, split == FALSE): This method will return a subset of the dataset passed as an argument where the split is False. (20 percent of the original dataset with respect to the given code)

#### Scaling The Features

```training_set[,3:4] = scale(training_set[,3:4]) test_set[,3:4] = scale(test_set[,3:4])```

The scale method in R can be used to scale the features in the dataset. Here we are only scaling the non-factors which are the age and the salary.

Output:

Training_set:

Test_set:

## Our Upcoming Events

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### No, the Infamous 6-Month AI Pause Letter Wasn’t a Failure

Though the AI pause wasn’t adopted, the letter’s impact is evident, says signatory

### AI in Programming is to Collaborate, Not Eliminate

While the potential of AI is unquestionable, a deeper look into its current capabilities suggests that a complete or even a partial AI takeover in programming is unlikely

### Apple Should be Scared of Windows Copilot

Copilot will start its early rollout as part of the free Windows 11 update, beginning on September 26

### Top 5 Libraries in C/C++ for ML in 2023

There are tons of libraries in C/C++ for ML, such as TensorFlow, Caffe, and mlpack

### Tesla Optimus Finally Learns Yoga, Performs Vrikshasana

Jim Fan, senior AI scientist at NVIDIA, has come forward with insights on how exactly Optimus functions with such brilliance

### NVIDIA’s Dominance Set to Surge Further

NVIDIA’s Meteoric Rise in 2023: On Track to Surpass \$50 Billion Revenue, Achieves \$1 Trillion Market Cap, and Forges Global Partnerships for AI Dominance.

### 6 Brilliant JavaScript Frameworks for Every Developer

Although Python and R are more famous for machine learning, Java can serve this purpose effectively, especially if you’re already familiar with it

### YouTube is All You Need

After attention, all Google needs is YouTube

### Meet the Researcher Curing the Healthcare System with ML

Ziad Obermeyer is bringing the long-delayed impact of ML in healthcare

### Why Focus on Future AI Regulations When Deepfake Crimes Persist?

With discussions on AI regulations happening on one side, and deepfake crimes increasing on the other, shouldn’t the present be checked before moving to the future?