Now Reading
How Does The Data Size Impact Model Accuracy?

How Does The Data Size Impact Model Accuracy?

Rohit Dwivedi
classification accuracy data size
W3Schools

In machine learning, while building predictive models we often come to a situation where we have fewer data. What to do in such scenarios? Do we need a very strong predictive model or more data to build our model? It is often said more data will always result in good performance of a model. But is it correct? 

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores. For this, we are going to use the Wine Dataset that is available on Kaggle. 

What we will learn from this? 

  • How the size of the data impacts the accuracy of a classification model?
  • Comparison of model accuracy with less and more number of data samples

Model 1: With Bigger Data Size

We will first build a classification model over a wine dataset where we need to classify the quality of the wine. We have 11 independent features that would be used for predicting the target, quality of the wine. We will now import the libraries required and the dataset. Use the below code for the same. 



import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
df = pd.read_csv('wine.csv')
print(df.shape) 

Now we will split the independent features and the target X and y respectively. We will then split the dataset into training and testing sets. After splitting we will fit the training data to the model and will make predictions using the model on testing data. Use the below code for the same. 

X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)

Model 2: Using Lesser Data Size

We will now build the same model using only 200 rows of the dataset and will check the accuracy of the model on the testing data. We will first read the dataset, split the data into independent and dependent variables X and y respectively. Then we will split the dataset into training and testing sets. Use the below code for the same. 

See Also

df = pd.read_csv('wine.csv')
#Making a random sample
df = df.sample(200)
X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)

Comparison of Classification Accuracy for both the Models

As we can see when we trained the model over the whole data we got an accuracy of 54% whereas when we trained the same model with only 200 rows the model gave an accuracy of 62%. This concludes that we cannot say whether more data is helpful or the model. But it is said more the data better would be the prediction made by the model. But this is not always true if you have enough important features that are strong predictors of the target than even with fewer data samples we can get good performance. If we have more data but it is not helping in predicting the target then this more data is not at all useful. Therefore techniques like Feature engineering and dimensionality reduction are done to keep only those predictors that are helpful.

Conclusion 

Through this article, we did a short experiment to check the model performance on a subset of the data and over the whole dataset. We built 2 different models and checked the accuracy. I would conclude the article by stating that it depends on the application we are working on. In computer vision, if we talk about image classification it’s the data that plays a very important role while classifying an image whereas in machine learning we cannot say more data is always equal to a good model. We can get into situations where less no of data gives more accuracy than the model build over less no of data. But it is always better to have more data over which feature engineering can be done and only strong predictors are kept.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top