“I can make a more accurate demand forecast than a local tea vendor himself if I get historical data,” claimed a data scientist friend. Though he said it half-jokingly, it made me wonder: Can a model predict a tea vendor’s business–who has been selling the tea for over 30 years– better than the tea vendor himself based on the historical data?
Data scientists often tend to rush into developing models without making adequate sense of data, generating results far from reality and introducing biases, leading to decision dilemmas. The GIGO (garbage in, garbage out) theory holds–quality data produces accurate output and vice versa. Poor data quality is the bete noire of machine learning models. The quality requirements of machine learning models are high, and bad data can appear twice — once in the historical data used to train the predictive model and again in the new data used by the same model to make future decisions. From my experience, more than half of the models developed are not approved by the business leaders just because of this problem.
Data and model: What’s the connection?
Sign up for your weekly dose of what's up in emerging technology.
“What we call data are observations of real-world phenomena. Each piece of data provides a small window into a limited aspect of reality. The collection of all these observations gives us a picture of the whole. But the picture is messy because it is composed of a thousand little pieces, and there’s always measurement noise and missing pieces,” said Alice Zheng and Amanda Casari in their data science book on feature engineering.
Historical data must meet broad and high-quality standards to train a predictive model properly. First and foremost, the data must be correct: Properly labelled, de-duplicated, etc. Secondly, it should be meaningful. Most data quality work focuses on one of the above requirements. To develop robust models, we must work on both simultaneously.
Most data today fail to meet basic standards – the causes range from data creators not understanding what is expected, inaccurate measurement equipment to overly complex processes and human error.
Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. Data preparation can make or break a model’s predictive ability. Different models have different sensitivities to the type of predictors in the model; how the predictors enter the model is also important. Transformations of the data to reduce the impact of data skewness or outliers can lead to significant improvements in performance.
Self-Service Analytics: It plays an important role in enabling business users to quickly prepare their data for exploratory analysis, thereby accelerating time-to-insight by allowing an organisation to bypass the IT bottleneck (projects can take months or years to deliver), enabling better business decision making. Data sources are prepared for analysis on the fly, removing the need for complex ETL processes for data discovery.
Understanding data is more of an art than science: Data scientists must wear different hats while understanding the data. Every time I do an exploratory analysis, I feel like Sherlock Holmes. Superficially, data is black and white with tons of files, rows and columns. We must look at the heart of a dataset to understand and bring some colour. The truth will be hidden behind those boring numbers somewhere. I see data scientists as artists. We should have an ever-inquisitive mind and ask a simple question repeatedly–why. This process is known as Exploratory Data Analysis (EDA).
Predictive Modelling: Before actually building, testing, and training the model, data scientists spend most of their time collecting various types of data and then preparing that data to make feature (meaning fields/attributes) engineering decisions. Feature engineering is the process of changing or creating new features to improve the model’s accuracy. The right EDA helps create meaningful features that help in generating sensible output. This is where business domain expertise comes in, as it entails adding new data sources, applying business rules, and reshaping or restructuring the data.
However, cleaning does not detect or correct all errors, and there is no way to understand the impact on the predictive model. Further, data does not always conform to “the right data” standards. Complex problems necessitate not just more data but data that is diverse and comprehensive, leading to more quality issues.
Implementation is not easy when it comes to data quality. Consider a company that wants to boost productivity with its machine learning programme. While the data science team that created the predictive model did a good job cleaning the training data, it can still be harmed by bad data in future. Again, finding and correcting errors requires a large number of people. This, in turn, undermines the expected productivity gains. Further, as machine learning technologies spread throughout the organisation, the output of one predictive model will feed the next, the next, and so on. The risk is that a minor error at one step will cascade, causing additional errors and growing in size across the entire value chain. Such concerns must be addressed with an aggressive, well-executed quality programme.
Sensible data is a prerequisite
Good historical data can be converted into good predictors and appropriate interpretation of model output, leading to prudent business decisions. But, first, we must define the goals and determine whether we have the necessary data to support the objective. When the data falls short, the best option is to collect new data, reduce the scope, or both.
Second, schedule enough time to implement data quality fundamentals in the overall project plan. For training, this translates to four person-months of cleaning for every person-month spent building the model. We must measure quality levels, assess sources, de-duplicate, and clean training data the same way we would for any important analysis. Implementations should eliminate the root causes of error and thus minimise cleaning. Begin this work as soon as possible and at least six months before you intend to release the predictive model.
Third, while preparing the training data, keep an audit trail. Keep a copy of the original training data, the data used in training, and the steps the organisations took to get from one to the other. This is good practice (though many people ignore it), and it may assist in making the process improvements required to use the predictive model in future decisions. Further, it is critical to understand the biases and limitations of the model, and the audit trail can assist in doing so.
Fourth, assign responsibility for data quality to a specific individual (or team) as you release the model. This person should have a thorough understanding of the data, including its strengths and weaknesses, and focus on two areas. First, they should establish and enforce standards for the quality of incoming data. If the data isn’t good enough, the business must step in. Second, they should be in charge of ongoing efforts to identify and eliminate root causes of error.
Finally, seek independent, stringent quality assurance. Because independence is the watchword here, this work should be performed by an internal QA department or a qualified third party.
Even after this, the data might not still be perfect. To investigate this possibility, team up data scientists with the most experienced businesspeople when preparing the data and training the model.
Machine learning experts run many potential data inputs through algorithms, tweak the settings, and iterate until the desired outputs are produced. While analytics may play no role here, in practice, a business often has far too many potential ingredients to throw into the blender all at once. Data analysts eliminate the “what ifs” from business decisions. They can not only extract and analyse information to ensure the correct path is taken, but they can also test to see which outcomes would be more beneficial to the business. In addition, they will track metrics related to significant changes so that when the decision-making process is complete, there will be no costly mistakes.
The best analysts are lightning-fast coders who can quickly scan massive datasets, surfacing potential insights. Their greatest asset is their speed, closely followed by their ability to identify potentially valuable gems. Mastery of visual information presentation is also beneficial: beautiful and effective plots allow the extraction of information faster, which pays off in terms of time-to-potential insights.
Data analysts should ensure high-quality, meaningful data is generated from raw data and is ready for the model. In contrast, data scientists are more concerned with feature identification, model selection, development, and ensuring the technical superiority of the developed model. Therefore, good analysts are required for data endeavours to be successful.
The other day at a fast-food joint, I overheard the owner lamenting, “IPL has killed the market today!”. From a data standpoint, there will be a drop in sales for that day. Our task as data scientists is to question that drop until we find out why it has happened and then include IPL as a feature in our model that can predict the sales accurately. We may not replace the decades of knowledge of a local business owner with (just) a data science model. Still, we can utilise the right and meaningful data to help a very large business understand nuances.
This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill the form here.