There is an increasing urgency to maintain reliable data assets around COVID-19 because of the speed at which developments are unfolding. This has made it challenging for the medical research community to keep up. These freely available datasets are offered to the global research community to produce new insights as the world continues its fight against COVID-19.
Here, we look at what these data assets are, and where they can be located:
Visual Dashboard Dataset
This is the data repository for the Coronavirus Visual Dashboard, managed by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Multiple organizations have extensively used it to track the geographic spread of the viral epidemic. The dataset is also supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).
Research Articles Dataset
In response to the COVID-19 pandemic, the Allen Institute for AI, White House and a group of top research groups have developed the COVID-19 Open Research Dataset (CORD-19). CORD-19 comprises over 47,000 scholarly articles, including over 36,000 with full text about COVID-19, SARS-CoV-2, and associated coronaviruses.
The CORD-19 dataset serves as the most comprehensive machine-readable coronavirus literature compilation ready for data mining at the moment. The Allen Institute produced this dataset for AI in cooperation with the Microsoft Research, Georgetown University’s Center for Security and Emerging Technology, Chan Zuckerberg Initiative, and National Institutes of Health, under collaboration with White House Office of Science and Technology Policy in the US.
The World Health Organization (WHO) has also been gathering the latest scientific verdicts and knowledge on COVID-19, and is organizing it in a database. WHO updates the database daily from the exploration of bibliographic databases, manual searches of the table of contents of associated scientific journals, and the addition of other relevant scientific articles. The entries in the database are not fixed, and additional research is supplemented daily.
Scan Images Dataset
The British Society of Thoracic Imaging (BSTI), in connection with Cimar UK’s Imaging Cloud Technology (cimar.co.uk), produced and deployed an anonymized and encrypted web portal to submit and refer images of patients from confirmed COVID-19 cases. From these, BSTI hopes to give an imaging database of established UK patient examples for reference and teaching. The intention is to quickly disseminate clinical and diagnostic information to frontline healthcare workers in the UK.
Lan Dao, Joseph Paul Cohen and Paul Morrison from the University of Montreal have also created a database of COVID-19 reported incidents with chest X-ray or CT scans and images. The database contains images from publications and has been released publicly in this GitHub repo. The researchers say the goal is to use these images to develop AI-based approaches to predict and understand the infection better.
The repository comprises an ongoing compilation of tweet IDs connected with the novel coronavirus COVID-19 (SARS-CoV-2), which began on January 28, 2020. Emily Chen from the University of Southern California used Twitter’s search API to find old Tweets from the preceding seven days, leading to the first tweets in the dataset dating back to January 22, 2020. Twitter’s streaming API was leveraged to follow particularized accounts and also collect real-time tweets that discussed specific keywords. To comply with Twitter’s Terms of Service, the dataset is only publicly released with the Tweet IDs of the collected Tweets for non-commercial research use.
Genome Sequences Data
Laboratories around the world are generating and sharing an increasing number of hCoV-19 genome sequences, clinical and epidemiological data associated with the novel coronavirus through GISAID. The genome sequences of hCoV-19 are essential to produce and assess diagnostic tests, to track and trace the ongoing outbreak, and to recognize possible intervention choices. The GISAID initiative supports the global sharing of all influenza virus sequences, and associated clinical and epidemiological data linked with human viruses to help researchers.