Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. It is like a lever you always have when you are driving a car. So more data leads to more predictive power. For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically.
But simply having more data is not useful. The saying that businesses need a lot of data is a myth. Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer. If you have 10 data points, this is probably not the case. You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful.
The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. In other words, it’s the belief (and overconfidence) that huge amounts of data is the answer to everything and that we can just train machines to solve problems automatically. Data by itself is not a panacea and we cannot ignore traditional analysis.
Researchers have demonstrated that massive data can lead to lower estimation variance and hence better predictive performance. More data increases the probability that it contains useful information, which is advantageous.
However, not all data is always helpful. A good example is clickstream data utilised by e-com companies where a user’s actions are monitored and analysed. Such data includes parts of the page that are clicked, keywords, cookie data, cursor positions and web page components that are visible. This is a lot of data coming in rapidly, but only a portion is valuable in predicting a user’s characteristics and preferences. The rest is noise. When data are taken from human actions, noise rates usually are high due to the limitations enforced by behavioural tendencies. What you ideally need is a set of data points that outline the range of variations with each class that one would like to train the ML system with.
Too Much Data
Having more data certainly increases the accuracy of your model, but there comes a stage where even adding infinite amounts of data cannot improve any more accuracy. This is what we called the natural noise of the data. When you work with different ML models, we see that certain features of the data are spread on a given variance, which is a probabilistic distribution.
Dipanjan Sarkar, Data Science Lead at Applied Materials explains, “The standard principle in data science is that more training data leads to better machine learning models. However what we need to remember is the ‘Garbage In Garbage Out’ principle! It is not just big data, but good (quality) data which helps us build better performing ML models. If we have a huge data repository with features which are too noisy or not having enough variation to capture critical patterns in the data, any ML models will effectively be useless regardless of the data volume.”
According to research, if the model is tuned with too much to the data, then it could essentially memorise the data, and that causes model overfitting, which causes high error rates for unseen data. If we are overfitting, we get wrong predictions and lose the focus on what’s actually important. An overfitting model implies that you have low bias and high variance and more data is not going to solve your problem. By placing too much emphasis on each data point, data scientists have to deal with a lot of noise and, therefore, lose sight of what’s really important. So adding more data points to the training set will not improve the model performance.
We need big data mostly when you have a ton of features, like image processing, where there is a need for ample data sources to train a model or language models for that matter. According to experts, you have to find the right parameters for fancy models that generally lead to big datasets to get high accuracy. There are many knobs, and you have to try enough knobs in the right parts of the space that contributes to reduced training error.
“There are no shortcuts or direct mathematical formulae to say if we have enough data. The only way would be to actually get out there and build relevant ML models on the data and validate based on performance metrics (which are in-line with the business metrics & KPIs) to see if we are getting a satisfactory performance,” Dipanjan further says.
It’s Not About The Quantity As Much As Quality Sampling
More data in principle is good. But actually, it matters to have the right kind of data. Sampling training data from your actual target domain always matters. Even within a domain, it matters how you sample. So modelling choices and data sampling approach jointly matter more than just data. Samples must represent real-world example data that have a good chance of being encountered in the future.
The main reason why data is desirable is that it lends more information about the dataset and thus becomes valuable. However, if the newly created data resemble the existing data, or simply repeated data, then there is no added value of having more data. For example, in an online review dataset, there is not much of a lift from the large dataset because you probably do not have a lot of variables and thousands of user reviews get you the same sample.
From a pure regression standpoint and if you have a true sample, data size beyond a point does not matter. There is diminishing value in adding observations from a Mean Square Error standpoint, a standard way to measure the error of a model in predicting quantitative data.
It is explicit from previous work that more data do not surely lead to greater predictive performance. It has been argued that sampling (decreasing the number of instances) or transformation of the data to lower the dimensional spaces (lessening the number of features) is beneficial. In fact, not all areas of machine learning are associated with big data. In fact, one of the most exciting and recent areas is related to making sense out of small data.
Reduced Data Requirements
When we think of advanced models, we assume that advanced machine learning models, everything has to be learned from the data. There are several use cases where few data points have worked equally well using techniques like simulation, etc., semi-supervised learning, etc.
Practically, there is research on neural network architectures that do reasonably well with just a thousand data points. They are not fancy but better than some machine learning methods if you have the right problem type.
“With the advent of innovative methodologies like transfer learning, unsupervised, self-supervised and semi-supervised learning, we are seeing new areas of research being actually adapted in the industry to build better quality ML models with less (labeled) data,” tells Dipanjan Sarkar.
There is also extensive work going on in terms of techniques that reduce the requirements for data. They are working on building ways to pull in human experience and knowledge rather than trying to discover everything from the raw data itself. Organisations are focusing on building hybrid machine learning systems that combine old fashioned rule-based systems with the underlying neural architectures, and have a bi-directional flow of information that learn from logical statements.
Other Factors For Not Desiring Big Datasets
For smaller firms, fewer datasets would be equally desirable or preferable to more data, and there are situations where more data present expenses that are not justified by the added value of the additional data. Data storage is an expense, and analysts who can work with datasets that are too large to fit in memory, with the appropriate tools, are more expensive than those who cannot.
A collection of a small dataset is good enough for answering the question of interest, and there is no incentive to collect additional data considering the practical time and financial burdens it may create. Hacking and privacy breaches are other possibilities with storing too much data which demands the efforts of a malicious entity to produce adverse consequences. There are also examples where a company may breach a privacy regulation in its quest to acquire a large dataset.