Data mining is a technique of extracting useful patterns and relationships from data, most commonly databases, texts, and the web. Data Mining uses statistical and pattern matching techniques to help discover insights. The most common concerns with data are it being noisy, full of missing values, its relevance to the problem, its size and complexity of data. When dealing with real-world data, it is often vast, which also comes with the challenge of it being imprecise at certain times and the data structure getting complex. There are various approaches to tackle and handle such real-world data out of which a few are popularly used by data engineers. In this article, we will be discussing the data handling or feature engineering techniques that are most commonly used. The major points that we will cover in this article are listed below.
Table of Contents
- How To Implement In Python?
Real-world data tends to be highly noisy with a large amount of meaningless and unwanted information, termed as noise. Binning is a technique that is used for reducing the cardinality of continuous and discrete data. Binning related values together in multiple bins is used to reduce the number of distinct values. Data binning, also known as bucketing, groups of data in bins or buckets, replaces values contained in a small interval with a representative value for that interval. Binning method tends to improve the accuracy in models, especially predictive models. It provides a new categorical variable feature from the data reducing the noise or non-linearity in the dataset.
The binning technique also promotes easy identification of outliers, invalid and missing values from the numerical variables present in the data. Making use of bins here is often referred to as binning or creating k bins, where k relates to the number of groups to which the numeric variable is mapped from the dataset. This technique can be applied to each of the numeric input variables in the training dataset, which are then provided as an input to the machine learning model to learn about predictive modelling tasks. Supervised binning is another method of intelligent binning where the important characteristics from the data are used to determine the bin boundaries.
Binning method is also used for the sheer purpose of data smoothening. Here the data is first sorted and then the sorted values get distributed into several buckets or bins. As binning methods consult the neighboring values, this is also known as local smoothing. The most common approach for binning is to divide the range of variables into k equal-width intervals. In equal frequency binning, we divide the range of the variables into intervals that contain approximately an equal number of points, although the equal frequency may not be possible due to repeated values.
Data transformation or transforming in the context of data mining is done by combining unstructured data with structured data for better analysis. It is also an important process when the data is being transferred to a new cloud data warehouse. When the present data is homogeneous and well-structured, it is easier to analyze and look for patterns and this is made possible using the transformation technique. Sometimes the data present in databases might possess unique IDs, keys and values. All these need to be formatted well so that the records are similar and can be correctly evaluated. Data transformation can also be further defined as the method of converting data from one particular format to another or changing the present structure into another structure.
Data transformation is a critical step before commencing other activities such as data integration and data management. Data transformation can include a wide range of activities such as converting data types, cleansing data by removing nulls or duplicate values, enhancing the data, or performing specific aggregations, depending on the needs. In a cloud data warehouse, it is arranged homogeneously to make it easier to recognize patterns. The data can be converted in multiple ways that are ideal for mining the data. The data transformation also involves Smoothing and Aggregation techniques.
Data collection or aggregation is the method of storing and presenting the data in a summary format. This is a crucial step as the accuracy of data analysis and insights generated is highly dependent on the quantity and quality of the data being used. Gathering accurate data of superior quality and in a humongous quantity is very essential to produce relevant results. Smoothing, on the other hand, is the process of eliminating noise from the data using algorithms that help highlight the important features present within the data. It also helps in predicting the present patterns correctly.
Feature scaling is a method that is used to normalize the range of independent variables or features present in the data. Manier times we observe that the range of data values vary widely. In a few machine learning algorithms, objective functions do not work properly without normalization. Many classifiers too, calculate the distance between two points by using the Euclidean distance. If one of the features contains a broad range of values, the distance will be governed by this aspect. The range of all the features present in the data should be normalized so that every feature contributes approximately and proportionately to the final distance. If feature scaling is not performed, then the machine learning model gives a higher weightage to the higher values and lower weightage to lower values. Also, it might take a lot of time for training the machine learning model.
Hence scaling methods such as Standardization is a very effective technique that re-scales a feature value so that it has a distribution with zero mean value and variance equals 1. Min-Max Normalization is another technique that rescales a feature or observation value with a distribution value between zero and one. The goal of applying Scaling is to make sure that the present data features are on almost the same scale so that each feature becomes equally important and hence it becomes easier to be processed by most ML algorithms.
The shuffling techniques aim to mix up data and help retain logical relationships between the data columns. It randomly shuffles data from a dataset within an attribute or a set of attributes. Using this method sensitive information can be shuffled to be replaced with other values for the same attribute from a different record. It is used for masking confidential numerical data as the values of the confidential variables are shuffled among the observed data points. The shuffled data may provide a high level of data utility and minimize further risks. Data shuffling helps overcome reservations about using modified confidential data because it retains all the desirable properties and performs better than the other masking techniques in both data utility and disclosure risk.
Data shuffling can be implemented using only rank order data, and thus it provides a nonparametric method for masking. The applicability of data shuffling stands the same for small and large data sets. During machine learning, we are required to split the dataset into further training, testing & validation datasets. It is very important that the dataset is shuffled well to avoid any element of bias or patterns in the split datasets before training begins for the ML model. Shuffling improves the model quality and the predictive performance of the model that it is being applied to.
How To Implement In Python?
Here I am going to demonstrate an example of the Binning technique that can be implemented through Python during the data mining process. Assume that we have a large amount of data and we cannot pass strings to a machine learning model. Therefore we might just need to convert the categorical features present in the dataset such as Sex, Embarked, and others into numeric values. Here the data binning technique can be very useful.
data['Sex'].replace(['male','female'],[0,1],inplace=True) data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True) data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
Converting this using the binning method,
data['Age_cat']=0 data.loc[data['Age']<=16,'Age_cat']=0 data.loc[(data['Age']>16)&(data['Age']<=32),'Age_cat']=1 data.loc[(data['Age']>32)&(data['Age']<=48),'Age_cat']=2 data.loc[(data['Age']>48)&(data['Age']<=64),'Age_cat']=3 data.loc[data['Age']>64,'Age_cat']=4
Similarly, other columns can be converted into categorical features by using the Binning method,
data['Fare_cat']=0 data.loc[data['Fare']<=7.775,'Fare_cat']=0 data.loc[(data['Fare']>7.775)&(data['Fare']<=8.662),'Fare_cat']=1 data.loc[(data['Fare']>8.662)&(data['Fare']<=14.454),'Fare_cat']=2 data.loc[(data['Fare']>14.454)&(data['Fare']<=26.0),'Fare_cat']=3 data.loc[(data['Fare']>26.0)&(data['Fare']<=52.369),'Fare_cat']=4 data.loc[data['Fare']>52.369,'Fare_cat']=5
Printing the Binned Table,
The use of several data techniques to handle data is an important aspect to be focused on as real-world data comes along with several challenges to be handled. If the discussed techniques and methods are correctly used and properly implemented, they might provide a lot of aid in generating a correct model with high accuracy.
In this article, we discussed several methods that help tackle real-world data such as Binning, Transforming, Scaling and Shuffling. These methods help in making the process of data mining a lot easier and help to generate better insights from the mined data. We also saw an example of the data Binning technique and where it can be used. I would request the reader to try other methods as well for a greater understanding.
Preprocessing Methods and Pipelines of Data Mining