Openrefine Tutorial: A Tool For Data Preprocessing Without Code

The purpose of this article is to explore and experiment with the openrefine tool using a sample dataset.

Openrefine is a powerful tool that helps you clean messy data without having to explicitly code even a single line. It was previously known as google refine and can be used similar to using Excel to perform complex data transformations. The main advantage of this tool is that it is completely free, open-source and the data does not leave your computer, which means all the operations can be done locally. 

The purpose of this article is to explore and experiment with the openrefine tool using a sample dataset. So, let’s begin.

When is this tool most useful?

The Openrefine tool is most effective for data migration projects. For datasets with columns containing address field, names etc which needs to be converted to a standard format before it is migrated and modelled, the openrefine tool works best. It is also very useful for columns that contain data with spelling errors, dates written in different formats, white spaces and special characters. These columns can be easily manipulated without writing functions to do so. 

Openrefine tool walkthrough

Before we can upload a dataset and clean it, we need to download the tool first. To download the openrefine tool click here. You have options for downloading the tool based on the operating system. Download the zipped file, extract them and click on the openrefine application to launch the tool. You will see a new window with http://127.0.0.1:3333/ host address open. 

openrefine

As you can see you can upload data folders or project folders. The database used here is Postgresql for holding large datasets. The datasets can be in CSV, TSV, Excel, JSON and many other formats. 

 For this tutorial, I will make use of the Pokemon dataset from Kaggle. You can download the same data or use a dataset of your choice. 

After uploading the dataset file in the space provided above, you will find an option to create a project as shown below. 

You can set the name for your project and create a project. 

Here, it shows the entire dataset and the rows in the dataset. I will show the manipulation techniques now. 

You can edit the values in the data frame directly as shown below. By hovering over the row you will see an edit option. Clicking that will lead you to a pop up where you can enter the new name.

openrefine

You also have the option to apply the edit to all the cells that are identical. 

As you see I changed the name of the row and applied to identical rows as well. 

It is now easy to identify how many categories of a particular data is available just by clicking on that column. I have clicked on the type column. 

data transformation

This shows that there are 18 unique categories and the number of times each of these categories have occurred in the dataset. 

Let us now join some columns. To join columns you can click on the column and select the join option. 

openrefine

The following tab is displayed. You can select the columns that you want to join and also specify what needs to be done with null values. Here, I decided to join the type1 column with type2. I have also replaced null with the word ‘attack’.

Thus the columns are joined to one. 

Like, join, you can also split one column into several columns.

data transformation

Here you can select the criteria for splitting the column. I will say separate by the word grass. 

As you can see the column is split into multiple columns. Think of this as analogous to one-hot encoding but for text data. 

You can drop the columns as well. Since I have combined type1 and type2 I can drop the type2. To drop the column, click on edit column and then select remove column. 

Doing this will remove the column from the dataset. 

The data manipulation can also be done through each cell as well. The options available here are

You can convert the data types into numbers, strings, dates also. Finally, you can sort the entire dataset based on one single column. 

openrefine

These are just some of the data manipulation methods that I have used according to the requirements of my dataset. There are other options available that may suit your requirements. You can check the documentation for other manipulations as well. 

Conclusion

In this article, we saw the different ways to manipulate the dataset without writing code using the openrefine tool. It is an effective way to locally change the data according to the user needs. Also, all cleaning actions are recorded as JSON script which helps in reproducibility and automation.

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR