Data is the new oil is the sayings of the past. Data is my property, and I should get paid for it might be the catchphrase of this decade.
All kinds of user data are being exploited, manipulated and minted by web-based marketplace services powered by algorithms. People unknowingly give vast amounts of their data to companies every day, which is used to generate massive amounts of profits.
But how valuable is any data?
And, how do we put a price tag on data in an unbiased way?
- Fairness: Watch out for adversarial attacks and data poisoning where one entity benefits more than others.
- Efficiency: Retaining performance during scaling up.
To formulate a metric that would help price the data, the researchers imbibe a Shapley value equivalent to their framework.
A Shapley Value For Data Pricing
Shapley value is widely used for profit allocation schemes. It attaches a real-value number to each player in the game to indicate the relative importance of their contributions. Similarly, the researchers considered each data provider as a player with a Shapley value, which is calculated based on the relevance of their contributions.
For example, a data provider with skin cancer dataset is more relevant for medical diagnosis than the one with road sign dataset designed for self-driving cars; a measure based on relevance.
Shapley value is given by the following expression:
Here, U(S) is the utility function that evaluates the worth of the player(data source) subset S.
Shapley value has also been chosen for its following properties that complement commercial usage of algorithmic solutions:
- Fully distributed
- Ensures fairness
- Brings in additivity
However, calculating Shapley value for a smaller dataset is challenging. Because, Shapley value needs many evaluations, and since evaluation here means re-training, the size of the dataset comes into the picture. So, in order to address the issue, the team at Berkeley AI research, use KNN classification to skip the need for re-training.
As shown in the figure above, the team demonstrated how the computational requirements of the Shapley value could be significantly reduced for KNN.
To address the scalability challenge in the online setting, the team has also developed an approximation algorithm to compute the Shapley value for KNN (K Nearest Neighbors) with improved efficiency.
One of the crucial challenges of introducing any metric into a machine learning setting is to overcome the adversarial attacks. When the Shapley value is checked for a dataset injected with noise, it showed promising results of detecting noisy training data. Thus, establishing confidence in adversarial robustness of the model.
Not only for putting a price tag on data, but the researchers believe that Shapley value can also help in improving the interpretability of AI models.
Dawn Of Data Dignity
Jaron Lanier, a tech pioneer, was featured in a recent New York Times op-ed explaining why people should get paid for their data. Under this scheme, he estimates the total value of data for a four-person household could fetch around $20,000.
Lanier insists that there is more to it than a mere monetary benefit. He calls it “data dignity.” Since the data exists because an individual exists, he/she should have the final say on what happens to their data while also being allowed to make money out of the data that they choose to provide.
If an e-commerce company skims through your buying history to recommend a product, that’s fair. But what if the same information is used to recommend similar products to other customers based on similarity scores?
For instance, if an old customer had bought a very expensive item X followed by a low price item Y. If a new customer on the website, buys this similar low price item Y, then they might be recommended X, which they never had plans to buy in the first place.
This looks like a typical recommendation engine use case.
Now, what if this new customer goes ahead and buys this X?
Then the e-commerce site has just hit the jackpot. But, only, in this case, these customer retaining strategies occur every second across the globe, minting billions of dollars. Now imagine how many other industries are directly and indirectly benefiting from user’s data. From increasing traffic to websites to personalising ads that make millions to insurance spams, data is being squeezed for profits every second without the knowledge of its owner.
As automated data-driven solutions will rule the market in the coming decades, there is an immediate need for a robust framework powered by metrics such as Shapley value, needs to be devised that will restore data dignity of an individual through transparent yet financially lucrative subscription plans.