Companies today are leveraging more and more of user data to build models that improve their products and user experience. Companies are looking to measure user sentiments to develop products as per their need. However, this predictive capability using data can be harmful to individuals who wish to protect their privacy.
Building data models using sensitive personal data can undermine the privacy of users and can also cause damage to a person if the data gets leaked or misused. A simple solution that companies have employed for years is data anonymisation by removing personally identifiable information in datasets. But researchers have found that you can extract personal information from anonymised datasets using alternate data, something known as linkage attacks.
As anonymised data is not good enough, other techniques have been increasingly utilised by companies to preserve privacy and security of data. In this article, we will take a look at them.
Differential privacy is a technique for sharing knowledge or analytics about a dataset by drawing the patterns of groups within the dataset and at the same time reserving sensitive information concerning individuals in the dataset. The concept behind differential privacy is that if the effect of producing an arbitrary single change in the database is small enough, the query result cannot be utilised to infer much about any single person, and hence provides privacy. Another way to explain differential privacy is that it is a constraint on the algorithms applied to distribute aggregate information on a statistical database, which restricts the exposure of individual information of database entries.
Fundamentally, differential privacy works by adding enough random noise to data so that there are mathematical guarantees of individuals’ protection from reidentification. This helps in generating the results of data analysis which are the same whether or not a particular individual is included in the data.
Facebook has utilised the technique to protect sensitive data it made available to researchers analysing the effect of sharing misinformation on elections. Uber employs differential privacy to detect statistical trends in its user base without exposing personal information. Google also open-sourced its differential privacy library, an internal tool used by the organisation to safely extract insights from datasets which contain sensitive personal information of its users.
Secure Multi-Party Computation
Based on cryptographic algorithms, Secure Multi-Party Computation (SMPC) allows multiple people to combine their private inputs to compute a function without revealing their inputs to each other. Parties can think of any function that they want to compute on private inputs, and they can exchange information and compute just the output of that particular function. Given the extraordinary advancements being made in the fields of artificial intelligence and machine learning, such a tool could be invaluable today.
For example, if a tech company provides a health diagnostic tool that is hosted on its cloud platform. Now, a patient with some sensitive medical information, and interface with the web tool and using SMPC, can execute diagnostics on their private data and learn whether the patient is at risk for some disease. All this can be done without the patient ever revealing anything about their confidential medical information to the tech company. In fact, it can be used in almost any scenario where information must be exchanged, and computation must be performed without trust in one another. One of the popular cryptographic algorithms used in the multi-party computation is Zero-Knowledge Proofs.
Standard machine learning approaches need centralising of training data on one machine or in a datacenter. And now companies like Google have built one of the most secure and robust cloud infrastructures for processing this data to make their services better. For models trained from user interaction with mobile devices, Google introduced a unique technique called Federated Learning.
TensorFlow Federated (TFF) by Google was also created to promote open research and experimentation with Federated Learning. It has been used, for instance, to train prediction models for mobile keyboards without the need to upload sensitive typing data to cloud servers.
Federated Learning allows mobile phones to collaboratively learn a shared ML model while keeping all the training data on the device, separating the ability to do data processing from the typical necessity of storing the data in the cloud.
According to Google, Federated Learning works by downloading the current model, which then improves itself by learning from data on the user phone. It then summarises the changes as a small, focused update. Only this update to the model is transferred to the cloud, utilising encrypted connection, where it is quickly averaged with other user updates to enhance the shared model. All the training data rests on the device, and no personal data is stored in the cloud.
Typically for running ML models, companies use data in an unencrypted format. Homomorphic encryption provides the capability to outsource the storage and computation of data to cloud environments in an encrypted form. Homomorphic encryption varies from typical encryption and multi-party computation methods in that it provides data processing to be done directly on encrypted data without needing access to a secret key.
Homomorphic encryption enables users to process ciphertexts to deliver desired results without decrypting the sensitive data. This can then be used to gather analytics, for example, on user data, without revealing the contents to the computation engine that is going to calculate the analytics. The output of such a process remains in an encrypted form and can be unveiled by the owner of the encryption key.
What is remarkable about homomorphic encryption is that people can achieve the same processing results (in encrypted form) by completing the computations on the encrypted data as they would have by performing it on unencrypted data. Research teams have shown that they can run machine learning algorithms on encrypted data using homomorphic encryption to preserve privacy.
Julia Computing, for instance, developed a process of applying homomorphic encryption run machine learning models on encrypted data. Here the user can send its encrypted data to the cloud using API and get the encrypted result from the machine learning models.
During the entire process, the data is neither decrypted nor stored in the cloud. Consequently, the cloud provider could not access the users’ data. Homomorphic encryption allows safe outsourcing of storage of computation on sensitive data to the cloud, but there are trade-offs with performance, protection and utility.
For most AI models, data is processed and inspected manually by humans to assure high quality for sophisticated AI learning. But human errors are inevitable. Human errors, incomplete data and differences from the original data may lead to unexpected outputs of AI learning. In this context, researchers have examined cases where AI learning data were inaccurate and insecure and called for the requirement for learning data management before machine learning is done.
Blockchain or distributed ledger technology can establish the integrity of training data. The data-preserving AI environment model is expected to prevent cyberattacks and data deterioration that may occur when raw data is utilised in an open network for collection and processing. The application of blockchain in this research can ensure data integrity to improve the reliability of AI.
Blockchain can encrypt and store the hashcode of raw data in separate time stamped block headers. At the time of processing data, the integrity of data can be verified and matched with any changes made in previous blocks. Through verifiable tracking of raw and processed datasets, blockchain can maintain optimum characteristics of the AI model.
Furthermore, it provides safety against malicious attacks on servers, such as DDoS (Distributed Denial Of Service), and prevents manipulation of data by insiders. In addition, blockchain is free from data leakage, thanks to inherent encryption utilised in the technique.
If you loved this story, do join our Telegram Community.
Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.