Why De-identifying Data Doesn’t Ensure Privacy

Federated learning framework offered an elevated level of privacy while maintaining utility of the global model.

“There’s a mismatch between what we think happens to our health data and what actually happens to it.” 

Nigam Shah

Recently the US Department of Health and Human Services has proposed amendments to the health policies under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). The amendments are aimed at changing the standards impeding the transition to value-based health care by discouraging “care coordination” and “case management” communications among individuals, hospitals, physicians, and other health care providers. What this actually means is that when health data is used for leveraging machine learning or other data driven tools, it must be “de-identified” in compliance with the HIPAA. Even if names, birthdates, gender, and other factors are removed, it doesn’t ensure privacy of patient’s records.

According to Nigam Shah, professor of medicine (biomedical informatics) and of biomedical data science at Stanford University, de-identified data can easily be re-identified when combined with other datasets, and the only protection from re-identification right now is the recipient of the data agreeing to not do so.

Learning from real-world health data has proven effective in multiple healthcare applications, resulting in improved quality of care generating medical image diagnostic tools predicting disease risk factors and analysing genomic data for personalised medicine but when electronic health records (EHRs) are used for public health or research purposes, the data must be de-identified. According to Shah, de-identification is done in two ways:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  • An expert issues a certificate that says certain data is de-identified 
  • By removing patient-identifying information including names, ages, addresses, email addresses, URLs, dates of service, and Social Security numbers etc.,

But, here’s the catch. Under HIPAA, de-identified EHRs are unprotected and can be open-sourced. De-identified datasets are no longer considered protected under HIPAA. These datasets can be (and often are) freely bought and sold by companies that can combine them with other information for various purposes. According to Shah, it is never possible to guarantee that de-identified data can’t or won’t be re-identified. That’s because de-identification is not anonymisation. If the goal is to maintain privacy, argues Shah, we should not be insisting on this legally defined imperfect procedure while pretending that it should give us privacy when it was never designed to do so.

Despite best practices for limiting access, terms of use agreements, and other ways to assure compliance, experts warn that de-identified data can still be re-identified through membership or inference attacks. Outside the US, where GDPR applies, only purely synthetic data is allowed. According to the experts, one solution is to release high fidelity synthetic data that can dramatically reduce the risk of disclosure, especially relative to traditional de-identification. However,as far as the solutions to improve de-identifications are concerned, Shah still laments that most of these methods are  just distractions.

People involved in healthcare programmes, especially patients, are usually unaware of the updated acts or the intricacies of the machine learning models. This allows de-identified data to flourish without legislative or regulatory intervention. But, that doesn’t mean that we should develop a framework that takes care of leak proof anonymisation because such data will end up being useless for the models. 

Going forward, rigorous policies that favor the participants/patients should be combined with methods such as federated learning or differential privacy to safeguard trust in the use of cutting edge technology for healthcare. For instance, the researchers at IBM have recently implemented a federated learning framework on healthcare applications using real-world electronic health data of 1 million patients. They observed that the federated learning framework offered an elevated level of privacy while maintaining utility of the global model.

Instead of de-identifying EHRs used for research, Shah recommends health record data should be kept private in the same way it is kept private when used by a patient’s medical team. “If we care about privacy, we should come up with a legal solution rather than rely on an imperfect technical crutch,” suggests Shah.

In countries like India, awareness of the pitfalls of using AI without checkpoints can do wonders. According to NITI Aayog, in India, AI adoption for healthcare applications is expected to see an exponential increase in the next few years. The healthcare market globally driven by AI is expected to register an explosive CAGR of 40% through 2021. The think tank believes the advances in technology, and interest and activity from innovators will allow India to solve some of its long-existing challenges in providing appropriate healthcare to a large section of its population. And, with such a diverse population posing unique challenges, the trade offs between innovation and policy should be exercised with a greater vigor. National eHealth Authority (NeHA), Integrated Health Information Program (IHIP), and Electronic Health Record Standards for India are few of the top initiatives that are designed to address the aforementioned issues.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox