MITB Banner

How Does The Data Size Impact Model Accuracy?

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores.

Share

classification accuracy data size

In machine learning, while building predictive models we often come to a situation where we have fewer data. What to do in such scenarios? Do we need a very strong predictive model or more data to build our model? It is often said more data will always result in good performance of a model. But is it correct? 

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores. For this, we are going to use the Wine Dataset that is available on Kaggle. 

What we will learn from this? 

  • How the size of the data impacts the accuracy of a classification model?
  • Comparison of model accuracy with less and more number of data samples

Model 1: With Bigger Data Size

We will first build a classification model over a wine dataset where we need to classify the quality of the wine. We have 11 independent features that would be used for predicting the target, quality of the wine. We will now import the libraries required and the dataset. Use the below code for the same. 

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
df = pd.read_csv('wine.csv')
print(df.shape) 

Now we will split the independent features and the target X and y respectively. We will then split the dataset into training and testing sets. After splitting we will fit the training data to the model and will make predictions using the model on testing data. Use the below code for the same. 

X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)

Model 2: Using Lesser Data Size

We will now build the same model using only 200 rows of the dataset and will check the accuracy of the model on the testing data. We will first read the dataset, split the data into independent and dependent variables X and y respectively. Then we will split the dataset into training and testing sets. Use the below code for the same. 

df = pd.read_csv('wine.csv')
#Making a random sample
df = df.sample(200)
X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)

Comparison of Classification Accuracy for both the Models

As we can see when we trained the model over the whole data we got an accuracy of 54% whereas when we trained the same model with only 200 rows the model gave an accuracy of 62%. This concludes that we cannot say whether more data is helpful or the model. But it is said more the data better would be the prediction made by the model. But this is not always true if you have enough important features that are strong predictors of the target than even with fewer data samples we can get good performance. If we have more data but it is not helping in predicting the target then this more data is not at all useful. Therefore techniques like Feature engineering and dimensionality reduction are done to keep only those predictors that are helpful.

Conclusion 

Through this article, we did a short experiment to check the model performance on a subset of the data and over the whole dataset. We built 2 different models and checked the accuracy. I would conclude the article by stating that it depends on the application we are working on. In computer vision, if we talk about image classification it’s the data that plays a very important role while classifying an image whereas in machine learning we cannot say more data is always equal to a good model. We can get into situations where less no of data gives more accuracy than the model build over less no of data. But it is always better to have more data over which feature engineering can be done and only strong predictors are kept.

Share
Picture of Rohit Dwivedi

Rohit Dwivedi

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.