MITB Banner

How To Implement ML Models With Small Datasets

Through this article, we will explore and understand ways how we can tackle over fitting and build a model on even small datasets.

Share

Machine learning is now being implemented in several different applications today. People these days are figuring out how they can use the power of machine learning in their domain. But they often come across the problem of lack of data. The data is not sufficient to build a predictive model over it. Also, when we build predictive models over this amount of data, often the model is overfitted and does not perform well. But what to do in these situations? How to build a model over a data set that has only 100-200 rows of data. 

Through this article, we will explore and understand ways how we can tackle this problem and build a model on even small datasets. We will also understand how to tackle the over-fitting situation. For this experiment, we will use the Iris data set that has three different classes of species in which we have to classify the flower. The dataset is publicly available on Kaggle for download. 

What we will learn from this article?

  1. How to build a machine learning model over a small dataset?
  2. What is Overfitting and how to overcome it? What are the different ways?

So let us begin our experiment.

  1. How to Build a Machine Learning Model over a Small Dataset?

Let us first import all the required libraries, data and explore the dataset. Use the below code for the same. 

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import seaborn as sns
df = pd.read_csv('iris.csv')
print(df.shape)

print(df)

Now we will encode the class column using a label encoder and will visualize the pair plot using the seaborn library. Use the below code for the same. 

le = LabelEncoder()

df['Class'] = le.fit_transform(df['Class'])

sns.pairplot(df)

The pair-plot analysis can help to understand the relationship between every column and also the target. We can also get an idea about the importance of features that are strong predictors of the target. Now we will divide the data into independent and dependent features X and y respectively. After defining these we will divide the dataset into training and testing sets. Refer to the below code for the same. 

X = df.drop('Class', axis = 1)
y = df['Class']
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

There are a total of 100 rows in the training and 50 rows in the testing. We will now define two different models and will fit the training data over them. Use the below code for the same. We will be using Logistic regression and KNeighbors for building the two models. 

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
lr.fit(X_train,y_train)
knn.fit(X_train,y_train)
print("Training Accuracy of KNN: ", knn.score(X_train,y_train))
print("Training Accuracy of LogisticRegression: ",lr.score(X_train,y_train))

Now we have trained these two models. Now we will compute predictions over the 50 rows that we had in the testing data. Refer the below code for the same. 

y_lr = lr.predict(X_test)

y_knn = knn.predict(X_test)

print("Testing Accuracy KNN : ", accuracy_score(y_knn,y_test))

print("Testing Accuracy LogisticRegression: ",accuracy_score(y_lr,y_test))

We can see after comparing the training and testing accuracy that the model is an overfitted model that means it will not perform well when we will expose it to production data.

Now we will see how we can overcome this situation. 

  1. What is Overfitting and how to overcome it? What are different ways?

Overfitted models are those models that perform very well in training but not so well in testing. Several different approaches can be used to get rid of overfitted models. We can make use of regularization techniques like Ridge and lasso that are used to prevent overfitting. Read this article “Hands-On Implementation of Lasso and Ridge Regression” to know how to use regularization. We can also look for outliers if they are present in the data and can remove them. Let us now check the descriptive statistics and some box plot visualization of the data set. Use the below code for the same. 

print(df.describe())

sns.boxplot(df['Class'],df['Sepal Length (in cm)']);

sns.boxplot(df['Class'],df['Sepal Width in (cm)']);

sns.boxplot(df['Class'],df['Petal length (in cm)']);

sns.boxplot(df['Class'],df['Petal width (in cm)']);

The descriptive statistics outputs mean, median, max, min values for each column in the dataset whereas boxplot visualization is used to detect the presence of outliers and also tells us the distribution of the data. There are not many of the extreme values present in the dataset. If we get many such outliers we can treat them. Check this article where I have explained about Outliers’ treatment “Outlier Detection Using Z score”. 

It is practically observed that tree-based algorithms outperform well in these types of situations when the models get overfitted. Now we will try building two new models using tree-based algorithms like Random Forest and Decision Tree and will check the results. Use the below code for the same. 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
dt.fit(X_train,y_train)
rf.fit(X_train,y_train)
print("Training Accuracy of Decision Tree: ", dt.score(X_train,y_train))
print("Training Accuracy of  Random Forest: ",rf.score(X_train,y_train))

y_dt = dt.predict(X_test)

y_rf = rf.predict(X_test)

print("Testing Accuracy of Decision Tree : ", accuracy_score(y_dt,y_test))

print("Testing Accuracy Random Forest: ",accuracy_score(y_rf,y_test))

As we can compare the training and testing accuracy of the tree-based model now the model does not get overfitted. This means tree-based algorithms are very useful when you have a small amount of data. They prevent overfitting and can also work well with missing values and outliers. Check this article titled “Practical Guide to Machine Learning Model Evaluation and Error Metrics” that will help you to evaluate the machine learning model using different error metrics. 

Conclusion

In this article, we discussed how to build machine learning models on small datasets. We initially explored linear models and noticed the overfitting problem. Then we discussed ways to get rid of this problem like regularization and outliers. Also, if we can find the important features that are strong predictors and only keep them then also it would result in a good machine learning model. At last, we built two more models using tree-based algorithms and compared their performance and found that they are really helpful when we are dealing with less amount of data. 

Also check this article “How Does Data Size Impact the Model Accuracy”. 

Share
Picture of Rohit Dwivedi

Rohit Dwivedi

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.