How Does The Data Size Impact Model Accuracy?

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores.

In machine learning, while building predictive models we often come to a situation where we have fewer data. What to do in such scenarios? Do we need a very strong predictive model or more data to build our model? It is often said more data will always result in good performance of a model. But is it correct?

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores. For this, we are going to use the Wine Dataset that is available on Kaggle.

What we will learn from this?

• How the size of the data impacts the accuracy of a classification model?
• Comparison of model accuracy with less and more number of data samples

Model 1: With Bigger Data Size

We will first build a classification model over a wine dataset where we need to classify the quality of the wine. We have 11 independent features that would be used for predicting the target, quality of the wine. We will now import the libraries required and the dataset. Use the below code for the same.

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
```import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
print(df.shape)
```

Now we will split the independent features and the target X and y respectively. We will then split the dataset into training and testing sets. After splitting we will fit the training data to the model and will make predictions using the model on testing data. Use the below code for the same.

```X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)
```

Model 2: Using Lesser Data Size

We will now build the same model using only 200 rows of the dataset and will check the accuracy of the model on the testing data. We will first read the dataset, split the data into independent and dependent variables X and y respectively. Then we will split the dataset into training and testing sets. Use the below code for the same.

```df = pd.read_csv('wine.csv')
#Making a random sample
df = df.sample(200)
X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)
```

Comparison of Classification Accuracy for both the Models

As we can see when we trained the model over the whole data we got an accuracy of 54% whereas when we trained the same model with only 200 rows the model gave an accuracy of 62%. This concludes that we cannot say whether more data is helpful or the model. But it is said more the data better would be the prediction made by the model. But this is not always true if you have enough important features that are strong predictors of the target than even with fewer data samples we can get good performance. If we have more data but it is not helping in predicting the target then this more data is not at all useful. Therefore techniques like Feature engineering and dimensionality reduction are done to keep only those predictors that are helpful.

Conclusion

Through this article, we did a short experiment to check the model performance on a subset of the data and over the whole dataset. We built 2 different models and checked the accuracy. I would conclude the article by stating that it depends on the application we are working on. In computer vision, if we talk about image classification it’s the data that plays a very important role while classifying an image whereas in machine learning we cannot say more data is always equal to a good model. We can get into situations where less no of data gives more accuracy than the model build over less no of data. But it is always better to have more data over which feature engineering can be done and only strong predictors are kept.

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Esri’s Journey in Shaping the Geospatial Landscape in India

Esri offers GeoAI within ArcGIS, providing ready-to-use models for working with various data types,

Why Meta Ray-Ban will Fail

Humane Ai Pin just burst the bubble of Meta Ray-Ban like smart glasses.

Synthetic Data Alone won’t Achieve AGI

LeCun thinks that Q* might be OpenAI’s attempt at “Planning”

Pixxel’s Hyperspectral Odyssey

It is set to launch world’s first high-resolution hyperspectral satellite constellation by 2024 and

Good News: Nobody Has to Work Anymore

Bill Gates recently said that people will eventually work only three days a week

NVIDIA Rides High on InfiniBands

“The vast majority of the dedicated large scale AI factories standardise on InfiniBand,” said

How NVIDIA is Helping Foxconn Unleash its EV Ambitions

Electronics manufacturers globally are enhancing digitalisation with NVIDIA’s AI, 3D, simulation, and autonomous tech.