Top Valuable DataSets For COVID-19 Researchers

covid 19 datasets data

There is an increasing urgency to maintain reliable data assets around COVID-19 because of the speed at which developments are unfolding. This has made it challenging for the medical research community to keep up. These freely available datasets are offered to the global research community to produce new insights as the world continues its fight against COVID-19.

Here, we look at what these data assets are, and where they can be located:  

Visual Dashboard Dataset

This is the data repository for the Coronavirus Visual Dashboard, managed by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Multiple organizations have extensively used it to track the geographic spread of the viral epidemic. The dataset is also supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Research Articles Dataset

In response to the COVID-19 pandemic, the Allen Institute for AI, White House and a group of top research groups have developed the COVID-19 Open Research Dataset (CORD-19). CORD-19 comprises over 47,000 scholarly articles, including over 36,000 with full text about COVID-19, SARS-CoV-2, and associated coronaviruses. 

The CORD-19 dataset serves as the most comprehensive machine-readable coronavirus literature compilation ready for data mining at the moment. The Allen Institute produced this dataset for AI in cooperation with the Microsoft Research, Georgetown University’s Center for Security and Emerging Technology, Chan Zuckerberg Initiative, and National Institutes of Health, under collaboration with White House Office of Science and Technology Policy in the US.

Download our Mobile App

The World Health Organization (WHO) has also been gathering the latest scientific verdicts and knowledge on COVID-19, and is organizing it in a database. WHO updates the database daily from the exploration of bibliographic databases, manual searches of the table of contents of associated scientific journals, and the addition of other relevant scientific articles. The entries in the database are not fixed, and additional research is supplemented daily. 

Scan Images Dataset

The British Society of Thoracic Imaging (BSTI), in connection with Cimar UK’s Imaging Cloud Technology (, produced and deployed an anonymized and encrypted web portal to submit and refer images of patients from confirmed COVID-19 cases. From these, BSTI hopes to give an imaging database of established UK patient examples for reference and teaching. The intention is to quickly disseminate clinical and diagnostic information to frontline healthcare workers in the UK.

Lan Dao, Joseph Paul Cohen and Paul Morrison from the University of Montreal have also created a database of COVID-19 reported incidents with chest X-ray or CT scans and images. The database contains images from publications and has been released publicly in this GitHub repo. The researchers say the goal is to use these images to develop AI-based approaches to predict and understand the infection better. 

Twitter Data

The repository comprises an ongoing compilation of tweet IDs connected with the novel coronavirus COVID-19 (SARS-CoV-2), which began on January 28, 2020. Emily Chen from the University of Southern California used Twitter’s search API to find old Tweets from the preceding seven days, leading to the first tweets in the dataset dating back to January 22, 2020. Twitter’s streaming API was leveraged to follow particularized accounts and also collect real-time tweets that discussed specific keywords. To comply with Twitter’s Terms of Service, the dataset is only publicly released with the Tweet IDs of the collected Tweets for non-commercial research use.

Genome Sequences Data

Laboratories around the world are generating and sharing an increasing number of hCoV-19 genome sequences, clinical and epidemiological data associated with the novel coronavirus through GISAID. The genome sequences of hCoV-19 are essential to produce and assess diagnostic tests, to track and trace the ongoing outbreak, and to recognize possible intervention choices. The GISAID initiative supports the global sharing of all influenza virus sequences, and associated clinical and epidemiological data linked with human viruses to help researchers. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Vishal Chawla
Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.