The White House, today, in their official press release has announced the release of COVID-19 Open Research Dataset(CORD-19). This dataset was released with the combined efforts of researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Microsoft and other top medical organisations.
CORD-19 dataset consists of the most extensive machine-readable Coronavirus literature collection available with over 29,000 articles, more than 13,000 of which have full text.
How CORD-19 Was Made
To curate these thousands of articles, Microsoft’s web-scale literature curation tools were used. Whereas the Allen AI team transformed the content into machine-readable form, making the corpus ready for analysis and study.
Researchers are encouraged to submit the tools they have developed for text mining and also their insights that can help the white house’s call to action, which can be accessed via the Kaggle platform. These tools will be openly available for researchers around the world through Kaggle.
The researchers recognised the need for sharing vital information across scientific and medical communities to accelerate the response to the coronavirus pandemic. The new COVID-19 Open Research Dataset is designed to help researchers worldwide to access important information faster.
Since it is difficult to manually go through more than 20,000 articles to draw insights, Kaggle has decided to upload the machine-readable versions of those articles, which can be accessed by the huge 4 million data scientists community.
Download all relevant material here:
- Commercial use subset (includes PMC content) — 9000 papers, 186Mb
- Non-commercial use subset (includes PMC content) — 1973 papers, 36Mb
- PMC custom license subset — 1426 papers, 19Mb
- bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) — 803 papers, 13Mb
- Kaggle challenge dataset
Here a list of other resources and platforms that can help fight COVID-19 using algorithms:
Machinehack’s Challenge To Keep Track Of COVID-19
The objective of the hackathon is to gauge COVID-19 on various metrics — confirmed cases, recovered cases, and death events for the subsequent day using historical data on a given date.
The dataset will be updated daily at 00:00 UTC standard time with the prevailing forecast of the well-defined target variables. Besides, the published data is dynamic, and hence it will be replenished each day in a new column. The data in the rows will also fluctuate based on the reported changes for COVID-19 outbreak in various world geographies.
Currently, bioRxiv has a repository of 539 articles related to COVID-19. This is a wonderful resource of a free online archive for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, not-for-profit research and educational institution.
To support urgent research to combat the ongoing outbreak of COVID-19, caused by the novel coronavirus SARS-CoV-2, the editorial teams at Nature Research have curated a collection of relevant articles. This collection includes research into the basic biology of coronavirus infection, its detection, treatment and evolution diseases, and coverage of current events. Nature publications have assured that these articles will remain free to access for as long as the outbreak remains a public health emergency.
Database of Chest X-Ray By University Of Montreal
This GitHub repo is maintained by a doctoral student at the University Of Montreal. This repo consists of a constantly updated database of COVID-19 cases with chest X-ray or CT images.
All images and data will be released publicly in this GitHub repo.
Check here for more details.
Consists of the world health organisation’s WHO database of publications on coronavirus disease (COVID-19).
GISAID has a database that consists of influenza virus sequences and epidemiological data associated with human viruses, both geographical as well as species-specific data associated with avian and other animal viruses. This data is quite useful for researchers who are studying the evolution of viruses based on different factors. This data can be helpful to develop accelerated approaches to prevent future pandemics.
Protein Data Bank
The protein data bank is a great repository that contains information regarding computationally predicted protein structures. Even DeepMind lab’s AlphaFold predictions have been uploaded to PDB. This initiative is aimed at enabling researchers to rapidly develop tests for this novel pathogen. Other labs have shared experimentally-determined and computationally-predicted structures of some of the viral proteins, and still, others have shared epidemiological data.