Now Reading
How To Eliminate Unrecognised Malignancy In Machine Learning Models Using Black-box Attacks

How To Eliminate Unrecognised Malignancy In Machine Learning Models Using Black-box Attacks

Abhishek Sharma

Machine Learning has been tried and tested by almost everyone, leading to opportunities for innovation. On the other hand, it might as well foster a breeding ground for hackers/attackers who exploit loopholes in the models or algorithms, thus compromising the functionality of ML system itself. In order to overcome this, many researchers have come up with techniques and tactics to counter and reduce this malicious spread and intent. This article explores a case study where researchers at Pennsylvania State University along with experts from OpenAI, US Army Research and University of Wisconsin, present a viable demonstration of the strategy used to hinder malicious attacks.

What is a black-box attack?

A black-box is usually an object or system that is located between the input variable and output variable (in computing terms), to observe how the output turns out to be without actually knowing the working or properties of the black box. In the case of ML, the black-box will generally be the learning model or the algorithm itself, such as the target function and so on. Now, attackers/hackers target the inputs as well as outputs (deep neural networks,in this case) without having any knowledge about the parameters such as network layers, ML algorithm, network architecture or even the size of the network. This is known as a black-box attack. To counter this, researchers have come up with their own black box attack, which incorporates ML techniques and, test them on various platforms such as Amazon and Google.

The counter-attack tactics

The strategy followed by the researchers is more leaned towards insights from adversarial machine learning that focuses on security aspects of ML. The attackers are labelled as ‘adversaries’, since in the context they attack the classifier models. The reason adversaries  target these models is that, they can manipulate the input data so that the learning algorithm learns or classifies the data wrongly which will lead to devastating consequences. In order to overcome this, the researchers build a substitute model by using a target deep neural network (DNN) — which acts as an oracle, to get a synthetic dataset. This way it allows attackers to think that they are working with real dataset while it is not the case in reality.

The strategy also utilises another concept called ‘transferability between architectures’, which means the relation between synthetic training data and the learning model fosters the attacker to come up with the wrong classification itself. In the words of the researchers, the strategy is as follows :

  1. Substitute Model Training: the attacker queries the oracle with synthetic inputs selected by a Jacobian based heuristic to build a model F approximating the oracle model O’s decision boundaries.
  2. Adversarial Sample Crafting: the attacker uses substitute network F to craft adversarial samples, which are then misclassified by oracle O due to the transferability of adversarial samples.

The first step is a long detailed process. It involves finding the right substitute architecture which works in tandem with the classification algorithm, and restricting the queries made by the attacker so that the approach is under control. The architecture is preferred to be a complex one such as a DNN.The next thing in the process is to ensure that the attacker stumbles across a large dataset. This helps in generating a synthetic dataset which renders the attack almost impenetrable. In order to achieve higher accuracy, the DNN samples created by the attacker are labelled, trained and augmented to obtain larger datasets, which will lead to increase in abstraction and the attacker will not know what is occurring in the background.

See Also
AI Black Box Creating A Black Future

The second step involves creating the ‘adversarial samples’ (mentioned in the previous step) using two popular algorithms : (i) Goodfellow et al. algorithm and (ii) Papernot et al. algorithm. Both of these help achieve the misclassification goal to be easier. The former algorithm accords to the cost function gradient while the latter checks the saliency values with respect to computing cost and image processing.

Attack validation

Once the strategy is developed, it is validated for its feasibility on a remote DNN of MetaMind Inc. This is tested for the oracle considered in their study such as MNIST Dataset and GTSRB Dataset. The results show that the accuracy of input misclassification by attackers is 84% and 64% respectively. In addition, the strategy was also deployed against classifiers by Amazon and Google.


This study was mainly done to identify the loopholes in the ML models. The researchers have provided the results from an attackers’ perspective, and presented it in the form of an attack strategy itself. This is to show that ML can be manipulated too. Almost everyday, ML and data science are coming up with better algorithms and working models. The vulnerability aspects should also be addressed simultaneously along with benefits.

Provide your comments below


If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top