Last week, ImageNet, one of the world’s most influential AI datasets, decided to blur the faces of people in its database in an effort to increase user privacy. In another part of the world, researchers at the University Erlangen-Nurnberg in Germany discovered that the X-ray datasets used by AI classification systems were not as anonymous as considered initially. As we move towards a greater dependence on artificial intelligence and machine learning, will it be at the cost of our privacy? Let’s delve deeper into these incidents.
ImageNet’s Privacy Overhaul
ImageNet began in 2009 as a project to compile images to see if the growth of artificial intelligence was held back by the lack of sufficient data. During its inception, ImageNet sourced its categories from WordNet, a database of English words categorised by synonyms. ImageNet leveraged the nouns from WordNet and used them as categories to scrape the internet for images. And to facilitate this, ImageNet used Amazon Mechanical Turk Workers to collect images of thousands of objects and people without their explicit consent.
In 2012, the ImageNet Large Scale Visual Recognition Challenge was launched. It heralded a new age in the field of AI, spurring on its development by giving companies access to a vast repository for deep learning. Since then, the database has expanded to over 1.5 million images categorised under 1000 words — 17% of which contain human faces, yet only three categories are related to people.
However, in a recent move, the company announced its decision to blur the faces of people in its images over fears of facial recognition systems being misused, affecting the privacy of the people represented. Moreover, many, if not all, of their faces, were collected without their consent.
Although the images were public record from social media profiles, the collection of it to train facial recognition algorithms is worryingly unethical. While the researchers behind ImageNet have said that the faces being blurred would not affect object recognition algorithms or benchmarking, there is still the possibility of algorithms that learn facial data from the new blurred images being unable to recognise ‘unblurred’ faces when encountering them.
Compromised Medical AI Dataset
The advancement of AI has also been intertwined with the field of medicine. Disease classification systems and X-ray analysis databases that are run by artificial intelligence are not uncommon any more. They, however, rely on large datasets for deep learning to ensure accurate diagnoses, and thus, anonymised patient data is often submitted by hospitals.
However, researchers at the University Erlangen-Nurnberg in Germany have found out that these AI datasets may not be as private as initially thought. They successfully devised a technique using deep learning reidentification, by which more than one X-ray scans can be identified as belonging to a single person with near 96% accuracy. The dataset accessed contains around 112,000 records. Thus, if even one X-ray scan is compromised by malicious attackers, the patient record of the individual linked to the scan can be stolen.
The X-rays that have common deformities are easier to match, according to the researchers. Even if there are virtually no common markers of identification, the tool can eventually find the match. In such cases, if attackers possess even a partial image of a patient, the reidentification technique can be expanded to access their records across multiple datasets.
In 2017, over 27% of identity thefts were connected with medical data breaches, and 15 million patient records were breached in 2018. Later, in just the first half of 2019, that number doubled.
The Way Forward
While the dataset pioneers like ImageNet have become increasingly aware of the privacy risks they pose, tech giants like Facebook continue to brazenly ignore the same. Researchers at Facebook AI announced that their SEER AI had progressed to the point that it can outperform current models in object recognition tests. However, there is a catch.
This was achieved by letting the AI scrape through over a billion images of users on Instagram. The only exception was using images from the EU, where the GDPR prevents violations of customer privacy. A few lawsuits have already been filed in the US against IBM’s Diversity in Faces dataset, FaceFirst — a SaaS-based facial surveillance company, and Google.
Therein lies one solution – stricter regulation. Social media advertising algorithms are already facing the heat when it comes to privacy-related threats, and the same should go for AI dataset privacy.