In machine learning, while building predictive models we often come to a situation where we have fewer data. What to do in such scenarios? Do we need a very strong predictive model or more data to build our model? It is often said more data will always result in good performance of a model. But is it correct?
Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores. For this, we are going to use the Wine Dataset that is available on Kaggle.
What we will learn from this?
- How the size of the data impacts the accuracy of a classification model?
- Comparison of model accuracy with less and more number of data samples
Model 1: With Bigger Data Size
We will first build a classification model over a wine dataset where we need to classify the quality of the wine. We have 11 independent features that would be used for predicting the target, quality of the wine. We will now import the libraries required and the dataset. Use the below code for the same.
import pandas as pd from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split df = pd.read_csv('wine.csv') print(df.shape)
Now we will split the independent features and the target X and y respectively. We will then split the dataset into training and testing sets. After splitting we will fit the training data to the model and will make predictions using the model on testing data. Use the below code for the same.
X = df.drop('quality',axis=1) y = df['quality'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) lr = LogisticRegression() lr.fit(X_train,y_train) y_pred = lr.predict(X_test) #Obtaining accuracy accuracy_score(y_pred,y_test)
Model 2: Using Lesser Data Size
We will now build the same model using only 200 rows of the dataset and will check the accuracy of the model on the testing data. We will first read the dataset, split the data into independent and dependent variables X and y respectively. Then we will split the dataset into training and testing sets. Use the below code for the same.
df = pd.read_csv('wine.csv') #Making a random sample df = df.sample(200) X = df.drop('quality',axis=1) y = df['quality'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) lr.fit(X_train,y_train) y_pred = lr.predict(X_test) #Obtaining accuracy accuracy_score(y_pred,y_test)
Comparison of Classification Accuracy for both the Models
As we can see when we trained the model over the whole data we got an accuracy of 54% whereas when we trained the same model with only 200 rows the model gave an accuracy of 62%. This concludes that we cannot say whether more data is helpful or the model. But it is said more the data better would be the prediction made by the model. But this is not always true if you have enough important features that are strong predictors of the target than even with fewer data samples we can get good performance. If we have more data but it is not helping in predicting the target then this more data is not at all useful. Therefore techniques like Feature engineering and dimensionality reduction are done to keep only those predictors that are helpful.
Through this article, we did a short experiment to check the model performance on a subset of the data and over the whole dataset. We built 2 different models and checked the accuracy. I would conclude the article by stating that it depends on the application we are working on. In computer vision, if we talk about image classification it’s the data that plays a very important role while classifying an image whereas in machine learning we cannot say more data is always equal to a good model. We can get into situations where less no of data gives more accuracy than the model build over less no of data. But it is always better to have more data over which feature engineering can be done and only strong predictors are kept.