# How Does The Data Size Impact Model Accuracy?

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores.

In machine learning, while building predictive models we often come to a situation where we have fewer data. What to do in such scenarios? Do we need a very strong predictive model or more data to build our model? It is often said more data will always result in good performance of a model. But is it correct?

Through this article, we will experiment with a classification model by having datasets of different sizes. We will build a model with less no of data samples and then more no of data samples and then check their accuracy scores. For this, we are going to use the Wine Dataset that is available on Kaggle.

### What we will learn from this?

• How the size of the data impacts the accuracy of a classification model?
• Comparison of model accuracy with less and more number of data samples

### Model 1: With Bigger Data Size

We will first build a classification model over a wine dataset where we need to classify the quality of the wine. We have 11 independent features that would be used for predicting the target, quality of the wine. We will now import the libraries required and the dataset. Use the below code for the same.

#### THE BELAMY

```import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
print(df.shape)
```

Now we will split the independent features and the target X and y respectively. We will then split the dataset into training and testing sets. After splitting we will fit the training data to the model and will make predictions using the model on testing data. Use the below code for the same.

```X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)
```

### Model 2: Using Lesser Data Size

We will now build the same model using only 200 rows of the dataset and will check the accuracy of the model on the testing data. We will first read the dataset, split the data into independent and dependent variables X and y respectively. Then we will split the dataset into training and testing sets. Use the below code for the same.

```df = pd.read_csv('wine.csv')
#Making a random sample
df = df.sample(200)
X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
#Obtaining accuracy
accuracy_score(y_pred,y_test)
```

### Comparison of Classification Accuracy for both the Models

As we can see when we trained the model over the whole data we got an accuracy of 54% whereas when we trained the same model with only 200 rows the model gave an accuracy of 62%. This concludes that we cannot say whether more data is helpful or the model. But it is said more the data better would be the prediction made by the model. But this is not always true if you have enough important features that are strong predictors of the target than even with fewer data samples we can get good performance. If we have more data but it is not helping in predicting the target then this more data is not at all useful. Therefore techniques like Feature engineering and dimensionality reduction are done to keep only those predictors that are helpful.

### Conclusion

Through this article, we did a short experiment to check the model performance on a subset of the data and over the whole dataset. We built 2 different models and checked the accuracy. I would conclude the article by stating that it depends on the application we are working on. In computer vision, if we talk about image classification it’s the data that plays a very important role while classifying an image whereas in machine learning we cannot say more data is always equal to a good model. We can get into situations where less no of data gives more accuracy than the model build over less no of data. But it is always better to have more data over which feature engineering can be done and only strong predictors are kept.

## More Great AIM Stories

### TypeScript vs JavaScript: Who’s Winning The 10-year-long Battle?

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

## AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Kickstart your career in Data Science and Business Analytics with this program from Great Learning

The curriculum of the PGP in Data Science and Business Analytics: V.22 has been updated in consultation with industry experts, academicians and program alums.

### How to build healthcare predictive models using PyHealth?

PyHealth is a Python-based toolbox. As the name implies, this toolbox contains a variety of ML models and architecture algorithms for working with medical data and modeling.

### Explained: Prospective learning in AI

A paper published earlier this year argued that retrospective learning isn’t a good representation of true intelligence.

### AI in SEO is so evolved now, it’s pitting against itself

With the integration of AI into SEO, can brands overcome the strict and ever vigilant guidelines of SERPs?

### Council Post: Key things to remember while building data teams

The AI team consists of  ‘an external team’ (a team external to the data team but part of the core AI team) that works closely with the data team and then there is the core data team itself.

### IBM launches new Mainframe model, aims to regain lost ground

Despite the cost-saving benefits and ease of sharing resources, only 25% of enterprise workloads have been moved to the cloud.

### How AI is used for the early detection of breast cancer

CNNs are efficient in detecting malignancies from scans.

### Google says no to FLoC, replaces it with Topic

Topic is a Privacy Sandbox proposal for internet-based advertising, which is replacing FLoC (Federated Learning of Cohorts).

### Learning Scala 101: The best books, videos and courses

Tagged as “the definitive book on Scala”, this book is co-authored by Martin Odersky, the designer of the Scala language.

### Meta AI proposes a new approach to improve object detection

Detic, like ViLD, uses CLIP embeddings as the classifier.