Differential privacy was one of the features in the 2021 Gartner’s Hype Cycle. It is a niche research area and has gained much relevance in recent years due to heightened awareness of privacy and security issues in modern systems. Once considered a luxury, privacy has become a mandate now. Combine that with strict policy frameworks like CCPA and GDPR, the overall context of privacy becomes much more serious than ever.
Speaking about the changing world of privacy with a focus on differential privacy was Manoj Kumar Rajendran, a Principal Data Scientist at MiQ Digital India. He presented at a tech talk, titled, ‘Understanding and Leveraging Differential Privacy’, on Day 1 of the Deep Learning Developer Conference 2021.
Sign up for your weekly dose of what's up in emerging technology.
“At MiQ, most work we do revolves around the usage of third party cookies. As we speak, the world is moving away from these third party cookies. Google recently announced that next year onwards, they will not be supporting 3rd party cookies anymore. We already have a cookieless scheme at MiQ which aims at preparing us for a world where privacy is a major deciding factor. At MiQ, we use 1000s of data vendors, and we collect huge amounts of data. Some of this information may not be available in a differentially private world. The onus is on every one of us to make the data and process private,” said Rajendran about the role of privacy at his parent company.
Conventional methods not enough
To demonstrate that classic and conventional methods of ensuring privacy, like hashing, are just not enough, Rajendran spoke about two prominent instances.
In 2005, streaming platform Netflix hosted a competition where they asked participants to determine the ratings that a user will give based on their movie rating history and the kind of movies they watched. The competition had to be prematurely scrapped for privacy reasons. University of Texas’ students were successfully able to identify the users that were hashed. Netflix team’s assumption that merely hashing user data would stall privacy attacks proved terribly wrong as they forgot to take into account one of the most important components — linkage attack. Merely a few hours into the competition, these students were even able to link all the user data to the data present with the IMDB rating database.
Another popular incident happened in New York City. Under the Freedom of Information request, the city admin released the data of all the taxi trips that happened in the last one year in the city. To ensure that all this data is privacy preserved, they hashed vital information like cab information, driver details, the number plate, etc. Like in the case of Netflix, it did not take more than a few hours for the hackers to retrieve all the data using hash decoders. In some cases, this information was linked with easily available information on public figures, creating a much bigger breach of privacy problem.
“These two incidents show that hashing and data anonymisation in a conventional way will not guarantee privacy,” said Rajendran.
Randomised Response & Differential Privacy
Many surveys have been using the concept of randomised response for quite some time now. Randomised response helps in understanding the overall response of the population rather than concentrating on individual choices.
To demonstrate this concept more clearly, Rajendran used the example of a survey to determine the number of smokers. For this example, he followed a simple randomised process, in which solving a simple equation could render the number of people who smoke. “I do not identify which person actually smokes. This means I learnt about the population, but I did not learn about a specific individual. It is an age-old technique and forms the backbone for differential privacy,” he added. Differential privacy is when the output is masked with random noise.
Credit: Manoj Kumar Rajendran
Rajendran also spoke about an interesting concept of ‘budget’ in differential privacy. This budget puts a limit on the amount of information that can be accessed. When a certain threshold is crossed, the system blocks additional pieces of information to be accessed.
Another aspect of differential privacy is deciding how much noise to be added. It should maintain privacy, but at the same time, it should render the database totally unusable. It is on the database management to decide the appropriate amount of noise.
Differential privacy existed from a long time ago but has been gaining traction of late. Interestingly, the latest US Census data was completely differentially private. Tech giants like Apple, too, have implemented local differential privacy in their systems.