How To Implement ML Models With Small Datasets

Through this article, we will explore and understand ways how we can tackle over fitting and build a model on even small datasets.

Machine learning is now being implemented in several different applications today. People these days are figuring out how they can use the power of machine learning in their domain. But they often come across the problem of lack of data. The data is not sufficient to build a predictive model over it. Also, when we build predictive models over this amount of data, often the model is overfitted and does not perform well. But what to do in these situations? How to build a model over a data set that has only 100-200 rows of data. 

Through this article, we will explore and understand ways how we can tackle this problem and build a model on even small datasets. We will also understand how to tackle the over-fitting situation. For this experiment, we will use the Iris data set that has three different classes of species in which we have to classify the flower. The dataset is publicly available on Kaggle for download. 

What we will learn from this article?

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  1. How to build a machine learning model over a small dataset?
  2. What is Overfitting and how to overcome it? What are the different ways?

So let us begin our experiment.

  1. How to Build a Machine Learning Model over a Small Dataset?

Let us first import all the required libraries, data and explore the dataset. Use the below code for the same. 

Download our Mobile App

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import seaborn as sns
df = pd.read_csv('iris.csv')


Now we will encode the class column using a label encoder and will visualize the pair plot using the seaborn library. Use the below code for the same. 

le = LabelEncoder()

df['Class'] = le.fit_transform(df['Class'])


The pair-plot analysis can help to understand the relationship between every column and also the target. We can also get an idea about the importance of features that are strong predictors of the target. Now we will divide the data into independent and dependent features X and y respectively. After defining these we will divide the dataset into training and testing sets. Refer to the below code for the same. 

X = df.drop('Class', axis = 1)
y = df['Class']
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=42)

There are a total of 100 rows in the training and 50 rows in the testing. We will now define two different models and will fit the training data over them. Use the below code for the same. We will be using Logistic regression and KNeighbors for building the two models. 

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(),y_train),y_train)
print("Training Accuracy of KNN: ", knn.score(X_train,y_train))
print("Training Accuracy of LogisticRegression: ",lr.score(X_train,y_train))

Now we have trained these two models. Now we will compute predictions over the 50 rows that we had in the testing data. Refer the below code for the same. 

y_lr = lr.predict(X_test)

y_knn = knn.predict(X_test)

print("Testing Accuracy KNN : ", accuracy_score(y_knn,y_test))

print("Testing Accuracy LogisticRegression: ",accuracy_score(y_lr,y_test))

We can see after comparing the training and testing accuracy that the model is an overfitted model that means it will not perform well when we will expose it to production data.

Now we will see how we can overcome this situation. 

  1. What is Overfitting and how to overcome it? What are different ways?

Overfitted models are those models that perform very well in training but not so well in testing. Several different approaches can be used to get rid of overfitted models. We can make use of regularization techniques like Ridge and lasso that are used to prevent overfitting. Read this article “Hands-On Implementation of Lasso and Ridge Regression” to know how to use regularization. We can also look for outliers if they are present in the data and can remove them. Let us now check the descriptive statistics and some box plot visualization of the data set. Use the below code for the same. 


sns.boxplot(df['Class'],df['Sepal Length (in cm)']);

sns.boxplot(df['Class'],df['Sepal Width in (cm)']);

sns.boxplot(df['Class'],df['Petal length (in cm)']);

sns.boxplot(df['Class'],df['Petal width (in cm)']);

The descriptive statistics outputs mean, median, max, min values for each column in the dataset whereas boxplot visualization is used to detect the presence of outliers and also tells us the distribution of the data. There are not many of the extreme values present in the dataset. If we get many such outliers we can treat them. Check this article where I have explained about Outliers’ treatment “Outlier Detection Using Z score”. 

It is practically observed that tree-based algorithms outperform well in these types of situations when the models get overfitted. Now we will try building two new models using tree-based algorithms like Random Forest and Decision Tree and will check the results. Use the below code for the same. 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(),y_train),y_train)
print("Training Accuracy of Decision Tree: ", dt.score(X_train,y_train))
print("Training Accuracy of  Random Forest: ",rf.score(X_train,y_train))

y_dt = dt.predict(X_test)

y_rf = rf.predict(X_test)

print("Testing Accuracy of Decision Tree : ", accuracy_score(y_dt,y_test))

print("Testing Accuracy Random Forest: ",accuracy_score(y_rf,y_test))

As we can compare the training and testing accuracy of the tree-based model now the model does not get overfitted. This means tree-based algorithms are very useful when you have a small amount of data. They prevent overfitting and can also work well with missing values and outliers. Check this article titled “Practical Guide to Machine Learning Model Evaluation and Error Metrics” that will help you to evaluate the machine learning model using different error metrics. 


In this article, we discussed how to build machine learning models on small datasets. We initially explored linear models and noticed the overfitting problem. Then we discussed ways to get rid of this problem like regularization and outliers. Also, if we can find the important features that are strong predictors and only keep them then also it would result in a good machine learning model. At last, we built two more models using tree-based algorithms and compared their performance and found that they are really helpful when we are dealing with less amount of data. 

Also check this article “How Does Data Size Impact the Model Accuracy”. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Rohit Dwivedi
I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.