MITB Banner

An honest revelation of a data scientist

Most of the online data science courses and articles do an amazing job of giving a brief understanding of the technicalities of data science. But they try to superimpose some well-known myths of data science as the reality deep inside a learner. It’s about time we burst these bubbles once and for all.

Share

An Honest Revelation of a Data Scientist

Illustration by An Honest Revelation of a Data Scientist

Time flies. It seems like it was yesterday when I walked out of my college as a novice statistician into the big leagues. Yet, six years and a lot of mistakes later, I can see myself growing up from a junior analyst into the role of data science consultant. A lot has changed in these six years in the world of learning data science, but some misconceptions still linger around.

In this article, I will try to address these misconceptions and try to draw a realistic picture of the world of data science.

Data Science Myths:

Most of the online data science courses and articles do an amazing job of giving a brief understanding of the technicalities of data science. But they try to superimpose some well-known myths of data science as the reality deep inside a learner. It’s about time we burst these bubbles once and for all.

#1 Data science is all building models. 

If you spend enough time-consuming data science-related content online, you will inevitably stumble upon terms like machine learning, artificial intelligence, neural networks and data modelling being thrown around. Unfortunately, the internet tends to overhype these keywords. In reality, Data Science requires one to thoroughly understand data, identify the patterns, and create signals that support the pattern. Data of the real world is messy and unstructured. It requires a lot of toiling to get the data up to the standards of finding any noticeable pattern, let alone modelling. During my early days, I nearly didn’t work on a single bit of modelling stuff but rather was invested in sourcing, validating and cleaning data, i.e. the mundane and unattractive part of data science. I then understood why these mundane things really matter and is the most crucial part of any data science solution.

#2 We have all-powerful algorithms that can do everything.

In reality, all algorithms have their set of advantages and drawbacks. One must carefully balance these trade-offs to extract the best out of them. A deep understanding of their background, assumptions and workings helps to evaluate the applicability of an algorithm in a particular situation. Moreover, judicious tweaking of hyper-parameters in even the most basic algorithms can provide statistically better and stable results over the standard version of high-end algorithms. I learnt the hard way that it’s better to stick to a particular algorithm and try to extract the best out of it instead of bombarding the data with every algorithm ever known to a human being. 

#3 Data Science is a one-man army.

The online courses provide “real-like live projects” but lack a key skill for collaboration on any data science project. Typically, a data science team will consist of – i) a Lead data scientist who provides overall guidance and manages the progress of a project, ii) a couple of Senior data scientists who work on complex data pattern recognition stuffs and solution designing, iii) a bunch of Junior data scientists who are still in the learning curve and iv) some Data Engineers who work on creating the right format of data. You will need to communicate regularly with your team regarding what you are doing, how you are doing, and the result. Your work will be evaluated and reviewed by the senior folks. You will have to work on a bunch of different tasks, be it data cleaning or pattern finding.

#4 There is a one-size-fits-all type of approach.

Sadly, each solution is different. The approach you will be working on depends on how the solution is designed. Here, it has become quite diverse due to the senior data scientists’ different skills and understanding levels of the data science approaches.

Even till this point, there is no SOPs into how a data science project is to be approached, and different intermediate processes are to be handled. Even the internet is clueless about this, and every other website provides a very different approach towards the same problem. The data scientist’s lack of SOP and variability of interpretational capabilities makes it extremely difficult to work in a team setting. The different skill levels make the team effort disjointed, and the project’s success lies solely on the capabilities of the most experienced and talented ones.

Problems due to Non-Standard Operation Procedures:

The lack of standardization is quickly galloping into the ladder of major work-stopper for data science. There are some basic steps you need to traverse for every solution, and you will inevitably face some issues with these steps as there is no standard procedure. Some data science professionals invented ways to tackle them from their own experiences, but not all have access to them. So, a lot of time is spent online looking for solutions at StackOverflow and similar platforms. Moreover, all the solutions presented online might not be relevant, and one has to make a lot of trial and error to find the exact solution.

An internal survey was conducted to measure the approximate time allocated by data science practitioners of various Wipro teams for the different steps of data science workflow. The survey result was compiled and averaged out to level out the skill difference among the data science practitioners. The results are shown in the table below.

Sl. No.StepDescriptionTime(in mins)
1ExploreExploration of the best approach to the ML model235
2FitTesting whether the approach fits the problem80
3ImplementImplementing the best approach to the problem80
4UnderstandDeep understanding of the data & creating features using data wrangling techniques50
5ModelCreate and validate ML model50
6ProductionProductionalize the ML Model using MLOps30
7ResearchIdeate new use cases, Brainstorm, read about new technologies0

Table 1: Showing average time allocated by a data scientist in 8.75 working hour of a day in redefined steps of ML process

It was observed that while the time to “Understand”, “Model” and “Production” remain more or less the same for every data scientist, the make-or-break moment arrives with the time spent on the “Explore” stage. This stage sets apart a novice from a master data scientist and pinpoints the massive skill gap between different data scientists and how that impacts the implementation of the whole project. Apart from these, data scientists nowadays have almost no “working minutes” left to spend learning about new technologies and brainstorming on new ideas. So, inevitably, data scientists need to extend their working hours to compensate. Moreover, they are compromising the time allocated dedicatedly to the “Understand” & “Model”, hampering the model’s quality and stability.

Conclusion

Standardization of ML processes is the need of the hour for any data scientist. The industry has off-late started to understand the perils of operating with a people-dependent approach to data science instead of a process-dependent one. A sufficiently equipped ML standardization will be able to reduce the burden on data scientists and enable them to utilize their resources better. The standardized procedure will also help in the democratization of the ML modelling framework and help create ML models with higher benchmarks.

Share
Picture of Siladitya Sen

Siladitya Sen

Siladitya Sen is a Business Analyst at Wipro Limited. He has received his M. Sc. In Statistics from Presidency University, Kolkata. He has close to 7 years of experience in the field of data science. He is quite proficient in building classical statistical Models, Machine Learning and AI models.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India