For the first time in its 200-year-old history, the US Census Bureau has announced that this year’s survey will implement new standards to safeguard citizen data. The government body is implementing differential privacy for this.
Differential privacy as a concept has been around since the early 2000s. Lately, the use of differential privacy has seen a great demand thanks to the increased adoption of data science techniques by organisations. Differential privacy as technology has also been named in the 2020 Gartner Hype cycle.
With data comes responsibility. To protect the privacy of data providers is crucial. Be it population census or customer feedback on app stores; no company should be able to trace the source easily.
Differential privacy offers a mathematical framework to anonymise data. It is a high-assurance, analytic means of ensuring that use cases like these are addressed in a privacy-preserving manner.
Differential privacy aims to ensure that regardless of whether an individual record is included in the data or not, a query on the data returns approximately the same result. Therefore, we need to know what the maximum impact of an individual record could be. This will be determined by the highest, and the lowest possible value in the data set and is referred to as the sensitivity of the data. The higher the sensitivity, the more noise needs to be applied.
According to Microsoft, to protect personally identifiable or confidential information within datasets, differential privacy utilises two mechanisms :
- Some statistical “noise” is added to each result to mask the contribution of individual data points.
- Information revealed from each query is calculated and deducted from an overall privacy budget to halt additional queries.
Here, noise can be a pixelated picture. It does work with protecting privacy, but there is a tradeoff with the accuracy of the algorithms.
Top 10 Players developing Differential Privacy tools. (Source: linknovate)
Companies like Google even rolled out tools like differential privacy libraries. Let’s’ take a look at a few of these tools:
Microsoft’s OpenDP
OpenDP is a suite of open-source tools developed by Microsoft and Harvard. OpenDP was developed to provide a privacy-protective analysis of sensitive personal data. The project is focused on algorithms for generating differentially private statistical releases. With OpenDP, the team wants to target applications mainly in the government, institutions where the sensitivity of the data being shared should be safeguarded to enable seamless scientific research.
IBM’s Diffprivlib
Developed by IBM, Diffprivlib is a general-purpose library. Developers can experiment, investigate and develop DP applications using this library. There are a few key features of this library, which IBM claims, are absent in other popular ones:
- For accountancy it offers limit privacy spend across multiple operations;
- Offers a comprehensive collection of the basic building blocks of differential privacy, used to build new tools and applications;
- For machine learning algorithms, it offers pre-processing, classification, regression and clustering.
Google’s Differential Privacy library
Google released its open-source library last year to meet the needs of developers. Here are some of the key features of this library:
- It supports most common data science operations. It can be used to compute counts, sums, averages, medians and percentiles, which are widely used techniques for differential privacy.
- Has an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
- It comes with a PostgreSQL extension and a quick start guide.
- Developers can include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.
Privacy is a cornerstone of data sharing, and differential privacy provides a definitive guide to navigate through the digital realms. Apple too employs differential privacy techniques while collecting feedback from its users in a safe way.