Openrefine is a powerful tool that helps you clean messy data without having to explicitly code even a single line. It was previously known as google refine and can be used similar to using Excel to perform complex data transformations. The main advantage of this tool is that it is completely free, open-source and the data does not leave your computer, which means all the operations can be done locally.
The purpose of this article is to explore and experiment with the openrefine tool using a sample dataset. So, let’s begin.
When is this tool most useful?
The Openrefine tool is most effective for data migration projects. For datasets with columns containing address field, names etc which needs to be converted to a standard format before it is migrated and modelled, the openrefine tool works best. It is also very useful for columns that contain data with spelling errors, dates written in different formats, white spaces and special characters. These columns can be easily manipulated without writing functions to do so.
Openrefine tool walkthrough
Before we can upload a dataset and clean it, we need to download the tool first. To download the openrefine tool click here. You have options for downloading the tool based on the operating system. Download the zipped file, extract them and click on the openrefine application to launch the tool. You will see a new window with http://127.0.0.1:3333/ host address open.
As you can see you can upload data folders or project folders. The database used here is Postgresql for holding large datasets. The datasets can be in CSV, TSV, Excel, JSON and many other formats.
For this tutorial, I will make use of the Pokemon dataset from Kaggle. You can download the same data or use a dataset of your choice.
After uploading the dataset file in the space provided above, you will find an option to create a project as shown below.
You can set the name for your project and create a project.
Here, it shows the entire dataset and the rows in the dataset. I will show the manipulation techniques now.
You can edit the values in the data frame directly as shown below. By hovering over the row you will see an edit option. Clicking that will lead you to a pop up where you can enter the new name.
You also have the option to apply the edit to all the cells that are identical.
As you see I changed the name of the row and applied to identical rows as well.
It is now easy to identify how many categories of a particular data is available just by clicking on that column. I have clicked on the type column.
This shows that there are 18 unique categories and the number of times each of these categories have occurred in the dataset.
Let us now join some columns. To join columns you can click on the column and select the join option.
The following tab is displayed. You can select the columns that you want to join and also specify what needs to be done with null values. Here, I decided to join the type1 column with type2. I have also replaced null with the word ‘attack’.
Thus the columns are joined to one.
Like, join, you can also split one column into several columns.
Here you can select the criteria for splitting the column. I will say separate by the word grass.
As you can see the column is split into multiple columns. Think of this as analogous to one-hot encoding but for text data.
You can drop the columns as well. Since I have combined type1 and type2 I can drop the type2. To drop the column, click on edit column and then select remove column.
Doing this will remove the column from the dataset.
The data manipulation can also be done through each cell as well. The options available here are
You can convert the data types into numbers, strings, dates also. Finally, you can sort the entire dataset based on one single column.
These are just some of the data manipulation methods that I have used according to the requirements of my dataset. There are other options available that may suit your requirements. You can check the documentation for other manipulations as well.
Conclusion
In this article, we saw the different ways to manipulate the dataset without writing code using the openrefine tool. It is an effective way to locally change the data according to the user needs. Also, all cleaning actions are recorded as JSON script which helps in reproducibility and automation.