Active Hackathon

Data Anonymization Is Not A Fool Proof Method: Here’s Why

Data anonymization is the process of stripping all personally identifiable information from the dataset while retaining only the relevant part without compromising the users’ privacy. One of its most important applications is in healthcare.

Hospitals often remove patients’ names, addresses, and other vital information from the health records before incorporating them into large datasets.


Sign up for your weekly dose of what's up in emerging technology.

Loopholes In Data Anonymization

In a few countries, data anonymization is mandated by law. For example, the US has exempted anonymized data from the privacy and security requirements. In 2016, the Australian government amended its privacy act to prevent re-identification of anonymized data obtained from Commonwealth entities.

In theory, data anonymization sounds like a great idea. However, it’s not completely fool-proof. More often than not, anonymization can’t stand up to deanonymization attacks. In such cases, the anonymized data is linked back to auxiliary information to identify the data subjects.

Recently, a team from the University Erlangen-Nurnberg in Erlangen, Germany, built a deep learning-based re-identification model to understand to what extent an X-ray classification system can compromise patient data. The model can tell if the given set of X-ray images are from the same person with an accuracy rate of 95.55 percent.

“We conclude that publicly available medical chest X-ray data is not entirely anonymous. Using a deep learning-based re-identification network enables an attacker to compare a given radiograph with public datasets and to associate accessible metadata with the image of interest. Thus, sensitive patient data is exposed to a high risk of falling into the unauthorised hands of an attacker who may disseminate the gained information against the will of the concerned patient,” the authors observed.

In 2019, a team from the Imperial College of London showed data could often be reverse-engineered using machine learning. The research proved – for the first time– de-anonymization could be done with incomplete datasets. As many as 99.98 percent of the sample were correctly re-identified from the considered datasets by using just 15 characteristics such as age, gender, and marital status. Lead author Luc Rocher had then said, “While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog.”

In a 2017 study titled ‘Health Data in an Open World’, the researchers successfully re-identified patients from an Australian open health dataset.

Possible Alternatives To Data Anonymization

Differential Privacy: It is a technique by which information about a dataset is publicly shared by describing groups’ patterns within the dataset while concealing the personally identifiable information. In this technique, individual data points average out across the dataset and prevent de-anonymization by giving technically incorrect information for each individual. It is already being used by companies such as Apple and Uber.

Federated learning: Google introduced the technique in 2017. Federated learning enables researchers to train statistical models based on decentralised servers with a local data set. Meaning, there is no need to upload private data to the cloud or exchange it with other teams. Federated learning is better than traditional machine learning techniques as it mitigates data security and privacy risks.

Homomorphic encryption: In this technique, calculations are performed on encrypted data without first decrypting it. Since Homomorphic encryption makes it possible to manipulate encrypted data without revealing the actual data, it has enormous potential in healthcare and financial services where the person’s privacy is most important.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM