How Researchers Are Building Models To Safeguard Private Data In Machine Learning

Share

Published on July 18, 2018

by Abhishek Sharma

More machine learning applications are permeating in the tech ecosystem and the data that goes into ML systems is being derived from all sorts of sources — regardless of its sensitivity. ML algorithms do not realise the aspect of sensitivity as it always looks at data as a way to establish and learn patterns, rather than looking into the who’s who of the data.

Miscreants might take advantage of this and circumvent the ML systems itself, which can have devastating effects altogether. If that happens, the purpose of ML will completely fail. To counter this, and establish a secure and safe ML environment, researchers are strictly working towards building privacy in ML models. In this article, we explore various studies that have seen privacy as the core focus of ML.

Using Anonymity For Privacy

In order to contain private data in ML from being misused, anonymity was also another option present. But this had a major setback. Anonymous data would sometimes mislead ML and it also lead to vague interpretation of data. Nicolas Papernot and Ian Goodfellow, lead researchers at Google, described the drawback with anonymity as follows:

“However, anonymising data is not always sufficient and the privacy it provides quickly degrades as adversaries obtain auxiliary information about the individuals represented in the dataset. Famously, this strategy allowed researchers to de-anonymise part of a movie rating dataset released to participants of the Netflix Prize when the individuals had also shared their movie ratings publicly on the Internet Movie Database (IMDb).”

That example cited by the researchers shows that the end results can be disastrous with data going incognito. This paved way for searching a foolproof method to regulate private data.

Differential Privacy: The Base For Measuring Privacy Risk

In his paper, reputed ML author Tom Mitchell emphasises that the benefits of ML are sometimes questioned with privacy issues. He suggests that in order to get the best of both ML and privacy, some trade-offs have to be considered amongst them. In addition, he highlights that ML can be modified to curb sensitive information from falling into the wrong hands, and thus preventing disasters in the system.

This was the reason many researchers started pursuing the aspect of incorporating privacy in various other fields of computer science such as cryptography, and database systems. One such concept of privacy that saw popularity in the research community was “differential privacy”, where it relies on mathematical relations and a statistically-controlled database to carefully extract information from data analysis algorithms.

From then on, there were various research studies which revolved around the original work of differential privacy to pass it on to various applications. The results from these studies proved that the concept found significance and powerful enough in other applications. Slowly, differential privacy also made into ML as well.

From Omnipresent To Private: Progress In Differential Data Privacy

One of the earliest studies that saw differential privacy in ML was by scholars Anand Sarwate and Kamalika Chaudhuri from the University of California where they explored various models and algorithms designed to maintain privacy, and then analyse with respect to ML, and signal processing. In the study, differential privacy is explained in terms of mathematical functions coupled with a database.

In order to achieve this, the authors add a noise function to obtain a differential approximation in the algorithm, which as a result gives out randomness in privacy. Now, this is applied for common ML tasks such as classification, regression, dimensionality reduction and time series, and thus a privacy-preserving mechanism is obtained in general. Although this study had practical limitations based on technical assumptions, it served as the basis for further studies.

The subsequent papers on privacy in ML focussed on optimising loss functions, dealing with missing data along with handling publicly available data.

Privacy-Preserving ML Models

Recent years have now seen unique approaches to privacy in ML. A popular amongst them are privacy-preserving ML models, which use insights from differential privacy concepts to counter malign attacks made at every phase and component of ML.

One paper by academics at Pennsylvania State University and the University of Michigan present a demonstrative ML model that showcases and analyses adversarial attacks at every stage of the model right from training to its inference. They layout threat models to pinpoint possibilities of attacks. The authors emphasise that there is always some kind of uncertainty between privacy and ML prediction, and this trade-off is to be balanced perfectly.

Another study called SecureML by researchers Payman Mohassel and Yupeng Zhang, provide novel ways to tackle adversaries in various scenarios in ML. They specifically consider stochastic gradient descent (SGD) to devise privacy-preserving ML algorithms with lesser loss and implications on data.

Now the importance has even spread to building systems for ML privacy as a service. A system called Chiron, has been developed to cater to entities that have ML-as-a-service (MLaaS) platforms. The developers of Chiron say that the system runs popular ML frameworks such as Theano, on SGX enclaves. They show that Chiron proves to be a practical element for ML privacy.

Conclusion

All of the above studies and methods depict the possibilities that can be made to entrust privacy in mission-critical ML systems. Since this is a new area of research, it is yet to achieve a full-fledged status for ML systems to be privacy-adherent.

Access all our open Survey & Awards Nomination forms in one place