Active Hackathon

Are Too Many Data Scientists Trying To Predict COVID-19 Outcomes In Futility?

data scientists covid19

Data scientists have been creating a lot of tools that help explain significant questions around COVID-19. One example is dashboards based on COVID-19 cases around the globe. It has helped show active cases, those in the testing phase, information on patient history, etc that provide a window into the overall scenario of the pandemic. 

There have also been many challenges and hackathons in response to COVID-19, and several data companies are providing free data resources. Kaggle has thousands of posts related to COVID-19. 


Sign up for your weekly dose of what's up in emerging technology.

The COVID-19 Open Research Dataset Challenge (CORD-19) dataset on Kaggle contains over 44,000 scholarly articles, and one Kaggle expert Daniel Wolffram has created several widgets that help navigate the current COVID-19 research literature. There are also geospatial trackers of multiple government initiatives built from the work of data scientists, which serve as valuable tools during the pandemic.

Explaining The Issue

Using hundreds of metrics, data scientists have been trying to predict COVID-19 outbreak. But, are the predictions accurate, given the pandemic is a black swan event with not much epidemiological records in the research literature? Even information relating to its DNA sequencing is new. 

While data scientists are using geographical cases to predict how COVID-19 will pan out, some professionals and data scientists on social media think the work is not accurate.

The issue lies in the fact that epidemiologists have been tracking and predicting the spread of pathogens for decades, long before machine learning professionals and data scientists. Also, data scientists may not have expertise when it comes to the highly complex biological aspects of predicting viral outbreaks.

Also, there may also be an issue with the datasets that are being used to create predictive models. “Existing datasets (on COVID-19) are incredibly biased. For example, when calculating the mortality rate, normally we look at the deaths per confirmed case. However, the underlying assumption is that we have captured all of the confirmed cases, which is not true, since we are bottlenecked by the number of tests and only the sickest are diagnosed. For a place like New York, an exponential increase in the availability of testing can also generate an exponential growth curve,” according to Neil Cheng, Senior Data Scientist at Akamai.

Wherever There Is Data, There Is Room For Data Science

Can data scientists can have a key role in predicting all aspects of the global pandemic, regardless of their experience in biology? This is because most microbiologists and epidemiologists have had little or no training on data analysis, where data scientists can add value. 

Indeed, there are only a handful of epidemiologists who are also good data scientists with backgrounds in mathematics, computer science, and machine learning. Here, pure data scientists can certainly collaborate with microbiologists and epidemiologists to create better predictive models. The issue is that when such models are created by pure data scientists, who do not realize whether a data set is even helpful or accurate in most cases, it becomes problematic. 

Wherever there is data, there is scope for data science to make an impact. Of course, data scientists should have some level of domain knowledge so they can effectively analyse and interpret the data.

It is not merely about predicting the COVID-19 outbreak. Data scientists could also help create better models on how to optimize the hospital infrastructure, medical supply chain and medical equipment manufacturing process such as ventilators and masks, instead of forecasting the outbreak of something as complex as a global pandemic. 

Data scientists may uncover unique patterns that may be valuable to those experts by leveraging advanced machine learning techniques. But findings would need to be peer-reviewed, validated, and examined by medical and epidemiological experts as acceptable. Yet, the majority of people downloading COVID-19 datasets may be unqualified to contribute in a meaningful way to save lives, as many point out. This pertains to the complexity of understanding microbiology and epidemiology.

More Great AIM Stories

Vishal Chawla
Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM