Data anonymization is the process of stripping all personally identifiable information from the dataset while retaining only the relevant part without compromising the users’ privacy. One of its most important applications is in healthcare.
Hospitals often remove patients’ names, addresses, and other vital information from the health records before incorporating them into large datasets.
Loopholes In Data Anonymization
In a few countries, data anonymization is mandated by law. For example, the US has exempted anonymized data from the privacy and security requirements. In 2016, the Australian government amended its privacy act to prevent re-identification of anonymized data obtained from Commonwealth entities.
In theory, data anonymization sounds like a great idea. However, it’s not completely fool-proof. More often than not, anonymization can’t stand up to deanonymization attacks. In such cases, the anonymized data is linked back to auxiliary information to identify the data subjects.
Recently, a team from the University Erlangen-Nurnberg in Erlangen, Germany, built a deep learning-based re-identification model to understand to what extent an X-ray classification system can compromise patient data. The model can tell if the given set of X-ray images are from the same person with an accuracy rate of 95.55 percent.
“We conclude that publicly available medical chest X-ray data is not entirely anonymous. Using a deep learning-based re-identification network enables an attacker to compare a given radiograph with public datasets and to associate accessible metadata with the image of interest. Thus, sensitive patient data is exposed to a high risk of falling into the unauthorised hands of an attacker who may disseminate the gained information against the will of the concerned patient,” the authors observed.
In 2019, a team from the Imperial College of London showed data could often be reverse-engineered using machine learning. The research proved – for the first time– de-anonymization could be done with incomplete datasets. As many as 99.98 percent of the sample were correctly re-identified from the considered datasets by using just 15 characteristics such as age, gender, and marital status. Lead author Luc Rocher had then said, “While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog.”
In a 2017 study titled ‘Health Data in an Open World’, the researchers successfully re-identified patients from an Australian open health dataset.
Possible Alternatives To Data Anonymization
Differential Privacy: It is a technique by which information about a dataset is publicly shared by describing groups’ patterns within the dataset while concealing the personally identifiable information. In this technique, individual data points average out across the dataset and prevent de-anonymization by giving technically incorrect information for each individual. It is already being used by companies such as Apple and Uber.
Federated learning: Google introduced the technique in 2017. Federated learning enables researchers to train statistical models based on decentralised servers with a local data set. Meaning, there is no need to upload private data to the cloud or exchange it with other teams. Federated learning is better than traditional machine learning techniques as it mitigates data security and privacy risks.
Homomorphic encryption: In this technique, calculations are performed on encrypted data without first decrypting it. Since Homomorphic encryption makes it possible to manipulate encrypted data without revealing the actual data, it has enormous potential in healthcare and financial services where the person’s privacy is most important.