It has always been said that change is inevitable. In case of Data Science also same thing holds good. Data Science has evolved a lot and that too drastically since the term was coined in the 90’s. Data Science has data as the core element. If data is not there no science could be applied on it and nothing much could be done. So, with this many question arises –
- Why we need the data?
- What kind of data is required?
- How to get the data?
- What to do with the data?
And the list goes on. Our mind never stops asking question about data. It is a good sign of a Data Scientist because who understands the value of data will only get the data correct.
To define these set of questions there should be some pre-defined path or flow. This flow is termed as Data Science project lifecycle. Sometimes there is a temptation to ditch this life cycle and bypass steps. It has been rightly said –
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
“There’s no elevator to success, you have to take the stairs”
- Business Understanding
Business Understanding plays a key role in success of any project. We have all the technology to make our lives easy but still with this tremendous change a success of any project depends on the quality of questions asked for the dataset.
Every domain and business work with a set of rules and goals. In order to acquire the correct data, we should be able to understand the business. Asking questions about dataset will hep in narrowing down to correct data acquisition.
2. Data Collection
As it is a well-known fact that there is no Data Science without Data. So, data serves important ingredient for making any Data Science project. Now the question comes where to get the data from. Data could be from various sources which could be – logs from webservers, data from online repositories, data from databases, social media data, data in excel sheet, so in short data can come from any source. Everywhere data is there. Newspaper, journals, online, websites, everything is made up of data only. If right questions have been asked in prior step then this becomes an easy step to narrow down to correct data sources.
A major challenge faced by data professionals in data acquisition step is to understand where the data comes from and whether it is the latest data or not. It makes it a crucial step to keep a track all through the project life cycle as data might to be re-acquired to do analytics and reach to conclusions.
3. Data Preparation
Data may be or may not be in required format. To perform any analytical step on the data it needs to be in certain format. It could also be said that data needs to be cleaned before processing any further. Thus, this step is also known as Data Cleaning or Data Wrangling.
Data acquired in previous step might not give clear analytical picture or patterns in the data. So, to understand this data needs to be structured and cleaned. Might be data is obtained from different sources but for analysis data need to be clubbed together from different sources. This is also referred as structuring the data. Apart from this data might have missing values which will cause obstruction in analysis and model building. There are various methods to do missing value and duplicate value treatment.
Exploratory Data Analysis (EDA) plays an important role at this stage as summarization of clean data helps in identifying the structure, outliers, anomalies and patterns in the data. These insights could help in building the model. EDA has the power as described in below quote –
“The greatest value of a picture is when it forces us to notice what we never expected to see” – John Tukey
4. Data Modelling
This stage seems to be most interesting one to almost all of the data scientists. Many people call it “a stage where magic happens”. But remember magic can happen only if you have correct props and technique. In terms of data science “Data” is that prop and data preparation is that technique. So before jumping to this step make sure to spend sufficient amount of time in prior steps.
Feature selection is one of the first things that you would like to do in this stage. Not all features might be essential for making the predictions. What needs to be done here is to reduce the dimensionality of the dataset. It should be done such that features contributing to the prediction results should be selected.
Based on the business problem models could be selected. It is essential to identify what is the ask, is it a classification problem, regression or prediction problem, time series forecasting or a clustering problem. Once problem type is sorted out model could be implemented.
After the modelling process, model performance measurement is required. For this precision, recall, F1-score for classification problem could be used. For regression problem R2, MAPE (Moving Average Percentage Error) or RMSE (Root Mean Square Error) could be used. Model should be a robust one and not an overfitted model. If it is overfitted model then predictions for future data will not come our accurately.
5. Interpreting Data
This is the last step of any Data Science project and also the most important step. Execution of this step should be as good as a layman should be able to understand the outcome of the project. The predictive power of the model lies in its ability to generalise.
Actionable insights from the model shows how Data Science has the power of doing predictive analytics and prescriptive analytics. This give us the power to learn how to repeat positive result, or how to prevent the negative result.
Last but not the least, visualization of findings should be done. It should be in line with business questions. It should be meaningful to the organisation and the stakeholders. Presentation through visualization should be such that it should trigger action in the audience.
All the above steps make a complete Data Science project but it is an iterative process and various steps are repeated until we are able to fine tune the methodology for a specific business case. Python and R are the most used languages for Data Science. Keep in mind the below lines by W. Edward Deming –
“Without data you’re just another person with an opinion”.