11 Open Source Datasets That Can Be Used For Health Science Projects

Machine learning is now widely deployed across various health sectors because of its ability to make real-time predictions and draw insights which usually go unnoticed given the voluminous and unstructured nature of the datasets. Here are few repositories that have culminated over the years thanks to the never-ending efforts of the researchers to make crucial metadata available to the common public so that they can try them out on their own models:

WHO (World Health Organisation)

WHO’s is authentic as it can it get when it comes to keeping track of the health of all the nations. Its open data source contains categories which include child nutrition, neglected diseases, risk factors pertaining to certain diseases among others.

The data is available in Excel format.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

OGD Platform India

This website consists of all the data collected from Indian health agencies and other entities. The categories in the catalogue range from primary health in tribal regions to state wise health reports.

There is an option to search the keyword to avail numerous well-curated resources.

Kaggle- Health Analytics

The dataset consists of 26 indicators like acute illness, chronic illness, immunisation, mortality and others. These indicators, in turn, have sub-categories which cover all the attributes.

The survey was conducted in Empowered Action Group (EAG) states Uttarakhand, Rajasthan, Uttar Pradesh, Bihar, Jharkhand, Odisha, Chhattisgarh and Madhya Pradesh and Assam.

This dataset covers  21 million population and 4.32 million households spread across the rural and urban area of these 9 states.

These benchmarks would help in better and holistic understanding and timely monitoring of various determinants on well-being and health of population particularly Reproductive and Child Health.

Heart Disease Data Set

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).


Project Brainomics provides the technical foundation for this database, based on a semantic web framework, bringing together imaging, genetics and questionnaire data.


OpenfMRI.org is a project dedicated to the free and open sharing of raw magnetic resonance imaging (MRI) datasets.

Number of currently available datasets: 95

Number of subjects across all datasets: 3,372

Mental Disorders

This data was collected via Collaborative Psychiatric Epidemiology Surveys (CPES) which were initiated in recognition of the need for contemporary, comprehensive epidemiological data regarding the distributions, correlates and risk factors of mental disorders.

The objective of the CPES was to collect data about the prevalence of mental disorders, impairments associated with these disorders, and their treatment patterns from representative samples of majority and minority adult populations in the United States.


This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes

Fields description follow:

preg = Number of times pregnant

plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test

pres = Diastolic blood pressure (mm Hg)

skin = Triceps skin fold thickness (mm)

test = 2-Hour serum insulin (mu U/ml)

mass = Body mass index (weight in kg/(height in m)^2)

pedi = Diabetes pedigree function

age = Age (years)

class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

CT Medical Images

The dataset is designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive.

Malaria Datasets

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.

The dataset contains a total of 27,558 cell images with equal instances of parasitised and uninfected cells.

Mental Health in Tech Survey

This data was collected with an aim to measure mental health in the tech workplace and examine the frequency of mental health disorders among tech workers.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox