Google has announced a new differentially private clustering algorithm that generates new representative data points privately. Privacy is a huge concern these days, with data leaks and breaches happening quite frequently. With so much individual data available to industries today, it has become all the more important to use technology to ensure that sensitive personal information is protected at all costs.
This is where Google has stepped in. The new algorithm from Google questions if a given database has sensitive user attributes and what can be done to reveal the group characteristics without exposing the privacy of individual users of the dataset. After evaluating the algorithm’s performance on multiple datasets and comparing it to existing baselines, the researchers found the results to be competitive or performing better.
Sign up for your weekly dose of what's up in emerging technology.
Recently, Facebook has also been working in this area of privacy maintenance. It introduced Opacus, a free, open-source library for training deep learning models with differential privacy.
The issue with clustering methods
In terms of machine learning algorithms, this can be done by clustering methods. Though k-means clustering is one such important method, it can reveal information about individual data points and compromise the user’s privacy.
What has Google really done?
This algorithm works in the central model of differential privacy. The central server has complete access to the raw data. Algorithms in the central model can be substituted in place of their non-private counterparts as it does not require any change in the process of data collection.
Then, it first generates, in a differentially private manner, a core-set that consists of weighted points that represent the data points. What follows after this is any (non-private) clustering algorithm on the privately generated core set.
Google says this algorithm generates the private core-set by first using random–projection–based locally sensitive hashing (LSH) recursively. It does this to partition points into ‘buckets’ of similar points. After that, each bucket is replaced by a single weighted point. This is the average of the points in the bucket.
Previous Privacy-Focused Initiatives
New Insights into Human Mobility with Privacy Preserving Aggregation – Google designed an algorithm to analyse population mobility with privacy-preserving techniques. It created representative models of aggregate data by using differential privacy along with k-anonymity.
Federated Analytics: Collaborative Data Science without Data Collection – Google introduced federated analytics, which uses data science methods to analyse raw data stored on the devices of users’ locally. It runs local computations on the data of each device and comes out with aggregated results. It does not reveal individual data from a particular device.
Learning Statistics with Privacy, aided by the Flip of a Coin – RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) – can study the user’s software’s behaviour while keeping the client’s privacy intact.
How does this algorithm satisfy differential privacy?
Due to the added noise, each data point’s contribution to the bucket counts and averages are concealed. In this way, the behaviour of the algorithm will not expose information about individual data points.
Does this always work?
This might not always work. The researchers said that individual points would not be well-represented by points in the core set if the points in the buckets is quite large.The added noise will become significant compared to the actual values if the points in the buckets are too small. This reduces the quality of the core set.
Image Source: Google
What did Google find out?
The research team evaluated the algorithm on benchmark datasets and compared it to non-private k-means++ algorithms and diffprivilib and dp-clustering-icml17. The research analysed the normalised k-means loss while it varied the number of target centres (k) for these benchmark datasets. The algorithm described above gave a lower loss than the other private algorithms in three out of the four datasets used.
Benchmark data sets used:
- Synthetic dataset of 100,000 data points in 100 dimensions sampled from a mixture of 64 Gaussians
- Neural representations for the MNIST dataset on handwritten digits obtained by training a LeNet model
- UC Irvine dataset on Gas Turbine CO and NOx Emissions
- UC Irvine dataset on Letter Recognition
Image Source: Google (Normalised loss for varying k = number of target clusters (lower is better). The solid curves denote the mean over the 20 runs, and the shaded region denotes the 25-75th percentile range.)
Though differentially private cluster algorithms are still in their initial stages, the potential looks very bright. With privacy being such an important issue now, these algorithms can help industries dealing with huge amounts of data maintain confidentiality while dealing with sensitive information.