Machine learning is now being implemented in several different applications today. People these days are figuring out how they can use the power of machine learning in their domain. But they often come across the problem of lack of data. The data is not sufficient to build a predictive model over it. Also, when we build predictive models over this amount of data, often the model is overfitted and does not perform well. But what to do in these situations? How to build a model over a data set that has only 100-200 rows of data.
Through this article, we will explore and understand ways how we can tackle this problem and build a model on even small datasets. We will also understand how to tackle the over-fitting situation. For this experiment, we will use the Iris data set that has three different classes of species in which we have to classify the flower. The dataset is publicly available on Kaggle for download.
What we will learn from this article?
- How to build a machine learning model over a small dataset?
- What is Overfitting and how to overcome it? What are the different ways?
So let us begin our experiment.
- How to Build a Machine Learning Model over a Small Dataset?
Let us first import all the required libraries, data and explore the dataset. Use the below code for the same.
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score import seaborn as sns df = pd.read_csv('iris.csv') print(df.shape)
Now we will encode the class column using a label encoder and will visualize the pair plot using the seaborn library. Use the below code for the same.
le = LabelEncoder()
df['Class'] = le.fit_transform(df['Class'])
The pair-plot analysis can help to understand the relationship between every column and also the target. We can also get an idea about the importance of features that are strong predictors of the target. Now we will divide the data into independent and dependent features X and y respectively. After defining these we will divide the dataset into training and testing sets. Refer to the below code for the same.
X = df.drop('Class', axis = 1) y = df['Class'] X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=42) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape)
There are a total of 100 rows in the training and 50 rows in the testing. We will now define two different models and will fit the training data over them. Use the below code for the same. We will be using Logistic regression and KNeighbors for building the two models.
from sklearn.linear_model import LogisticRegression lr = LogisticRegression() from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() lr.fit(X_train,y_train) knn.fit(X_train,y_train) print("Training Accuracy of KNN: ", knn.score(X_train,y_train)) print("Training Accuracy of LogisticRegression: ",lr.score(X_train,y_train))
Now we have trained these two models. Now we will compute predictions over the 50 rows that we had in the testing data. Refer the below code for the same.
y_lr = lr.predict(X_test)
y_knn = knn.predict(X_test)
print("Testing Accuracy KNN : ", accuracy_score(y_knn,y_test))
print("Testing Accuracy LogisticRegression: ",accuracy_score(y_lr,y_test))
We can see after comparing the training and testing accuracy that the model is an overfitted model that means it will not perform well when we will expose it to production data.
Now we will see how we can overcome this situation.
- What is Overfitting and how to overcome it? What are different ways?
Overfitted models are those models that perform very well in training but not so well in testing. Several different approaches can be used to get rid of overfitted models. We can make use of regularization techniques like Ridge and lasso that are used to prevent overfitting. Read this article “Hands-On Implementation of Lasso and Ridge Regression” to know how to use regularization. We can also look for outliers if they are present in the data and can remove them. Let us now check the descriptive statistics and some box plot visualization of the data set. Use the below code for the same.
sns.boxplot(df['Class'],df['Sepal Length (in cm)']);
sns.boxplot(df['Class'],df['Sepal Width in (cm)']);
sns.boxplot(df['Class'],df['Petal length (in cm)']);
sns.boxplot(df['Class'],df['Petal width (in cm)']);
The descriptive statistics outputs mean, median, max, min values for each column in the dataset whereas boxplot visualization is used to detect the presence of outliers and also tells us the distribution of the data. There are not many of the extreme values present in the dataset. If we get many such outliers we can treat them. Check this article where I have explained about Outliers’ treatment “Outlier Detection Using Z score”.
It is practically observed that tree-based algorithms outperform well in these types of situations when the models get overfitted. Now we will try building two new models using tree-based algorithms like Random Forest and Decision Tree and will check the results. Use the below code for the same.
from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier() from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier() dt.fit(X_train,y_train) rf.fit(X_train,y_train) print("Training Accuracy of Decision Tree: ", dt.score(X_train,y_train)) print("Training Accuracy of Random Forest: ",rf.score(X_train,y_train))
y_dt = dt.predict(X_test)
y_rf = rf.predict(X_test)
print("Testing Accuracy of Decision Tree : ", accuracy_score(y_dt,y_test))
print("Testing Accuracy Random Forest: ",accuracy_score(y_rf,y_test))
As we can compare the training and testing accuracy of the tree-based model now the model does not get overfitted. This means tree-based algorithms are very useful when you have a small amount of data. They prevent overfitting and can also work well with missing values and outliers. Check this article titled “Practical Guide to Machine Learning Model Evaluation and Error Metrics” that will help you to evaluate the machine learning model using different error metrics.
In this article, we discussed how to build machine learning models on small datasets. We initially explored linear models and noticed the overfitting problem. Then we discussed ways to get rid of this problem like regularization and outliers. Also, if we can find the important features that are strong predictors and only keep them then also it would result in a good machine learning model. At last, we built two more models using tree-based algorithms and compared their performance and found that they are really helpful when we are dealing with less amount of data.
Also check this article “How Does Data Size Impact the Model Accuracy”.