Time flies. It seems like it was yesterday when I walked out of my college as a novice statistician into the big leagues. Yet, six years and a lot of mistakes later, I can see myself growing up from a junior analyst into the role of data science consultant. A lot has changed in these six years in the world of learning data science, but some misconceptions still linger around.
In this article, I will try to address these misconceptions and try to draw a realistic picture of the world of data science.
Data Science Myths:
Most of the online data science courses and articles do an amazing job of giving a brief understanding of the technicalities of data science. But they try to superimpose some well-known myths of data science as the reality deep inside a learner. It’s about time we burst these bubbles once and for all.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
#1 Data science is all building models.
If you spend enough time-consuming data science-related content online, you will inevitably stumble upon terms like machine learning, artificial intelligence, neural networks and data modelling being thrown around. Unfortunately, the internet tends to overhype these keywords. In reality, Data Science requires one to thoroughly understand data, identify the patterns, and create signals that support the pattern. Data of the real world is messy and unstructured. It requires a lot of toiling to get the data up to the standards of finding any noticeable pattern, let alone modelling. During my early days, I nearly didn’t work on a single bit of modelling stuff but rather was invested in sourcing, validating and cleaning data, i.e. the mundane and unattractive part of data science. I then understood why these mundane things really matter and is the most crucial part of any data science solution.
#2 We have all-powerful algorithms that can do everything.
In reality, all algorithms have their set of advantages and drawbacks. One must carefully balance these trade-offs to extract the best out of them. A deep understanding of their background, assumptions and workings helps to evaluate the applicability of an algorithm in a particular situation. Moreover, judicious tweaking of hyper-parameters in even the most basic algorithms can provide statistically better and stable results over the standard version of high-end algorithms. I learnt the hard way that it’s better to stick to a particular algorithm and try to extract the best out of it instead of bombarding the data with every algorithm ever known to a human being.
#3 Data Science is a one-man army.
The online courses provide “real-like live projects” but lack a key skill for collaboration on any data science project. Typically, a data science team will consist of – i) a Lead data scientist who provides overall guidance and manages the progress of a project, ii) a couple of Senior data scientists who work on complex data pattern recognition stuffs and solution designing, iii) a bunch of Junior data scientists who are still in the learning curve and iv) some Data Engineers who work on creating the right format of data. You will need to communicate regularly with your team regarding what you are doing, how you are doing, and the result. Your work will be evaluated and reviewed by the senior folks. You will have to work on a bunch of different tasks, be it data cleaning or pattern finding.
#4 There is a one-size-fits-all type of approach.
Sadly, each solution is different. The approach you will be working on depends on how the solution is designed. Here, it has become quite diverse due to the senior data scientists’ different skills and understanding levels of the data science approaches.
Even till this point, there is no SOPs into how a data science project is to be approached, and different intermediate processes are to be handled. Even the internet is clueless about this, and every other website provides a very different approach towards the same problem. The data scientist’s lack of SOP and variability of interpretational capabilities makes it extremely difficult to work in a team setting. The different skill levels make the team effort disjointed, and the project’s success lies solely on the capabilities of the most experienced and talented ones.
Problems due to Non-Standard Operation Procedures:
The lack of standardization is quickly galloping into the ladder of major work-stopper for data science. There are some basic steps you need to traverse for every solution, and you will inevitably face some issues with these steps as there is no standard procedure. Some data science professionals invented ways to tackle them from their own experiences, but not all have access to them. So, a lot of time is spent online looking for solutions at StackOverflow and similar platforms. Moreover, all the solutions presented online might not be relevant, and one has to make a lot of trial and error to find the exact solution.
An internal survey was conducted to measure the approximate time allocated by data science practitioners of various Wipro teams for the different steps of data science workflow. The survey result was compiled and averaged out to level out the skill difference among the data science practitioners. The results are shown in the table below.
|Sl. No.||Step||Description||Time(in mins)|
|1||Explore||Exploration of the best approach to the ML model||235|
|2||Fit||Testing whether the approach fits the problem||80|
|3||Implement||Implementing the best approach to the problem||80|
|4||Understand||Deep understanding of the data & creating features using data wrangling techniques||50|
|5||Model||Create and validate ML model||50|
|6||Production||Productionalize the ML Model using MLOps||30|
|7||Research||Ideate new use cases, Brainstorm, read about new technologies||0|
Table 1: Showing average time allocated by a data scientist in 8.75 working hour of a day in redefined steps of ML process
It was observed that while the time to “Understand”, “Model” and “Production” remain more or less the same for every data scientist, the make-or-break moment arrives with the time spent on the “Explore” stage. This stage sets apart a novice from a master data scientist and pinpoints the massive skill gap between different data scientists and how that impacts the implementation of the whole project. Apart from these, data scientists nowadays have almost no “working minutes” left to spend learning about new technologies and brainstorming on new ideas. So, inevitably, data scientists need to extend their working hours to compensate. Moreover, they are compromising the time allocated dedicatedly to the “Understand” & “Model”, hampering the model’s quality and stability.
Standardization of ML processes is the need of the hour for any data scientist. The industry has off-late started to understand the perils of operating with a people-dependent approach to data science instead of a process-dependent one. A sufficiently equipped ML standardization will be able to reduce the burden on data scientists and enable them to utilize their resources better. The standardized procedure will also help in the democratization of the ML modelling framework and help create ML models with higher benchmarks.