Explained: MIT Scientists’ New Reinforcement Learning Approach To Tackle Adversarial Attacks

Explained: MIT Scientists’ New Reinforcement Learning Approach To Tackle Adversarial Attacks

Adversarial inputs, also known as machine learning’s optical illusions, are inputs to the model an attacker has intentionally designed to confuse the algorithm into making a mistake. Such inputs can be typically dangerous for machines with a very low margin for risk. For instance, in self-driving cars, an attacker could target an autonomous vehicle with an adversarial stop sign to fool a car into stopping.

In this article, we try to examine why it is difficult to overcome adversarial inputs and how academics at MIT overcome these obstacles.

The Challenge

Suppose you have to train a model to identify whether a picture you have clicked is of a dog or not. In an ideal scenario, a model trained on labelled images of dogs and images that are not of dogs would be enough. But what an adversary can do is try to introduce, for instance, a morphed image of a cat an original training model might not be able to distinguish from a dog.

In this case, the ML developer can rely on adversarial training. The developer tries to train the model by inputting variations of altered images that could classify the image as a dog. This can make the model immune to adversarial attacks. However, there are two drawbacks to this approach.

Firstly, the brute-force solution to generate every possible adversarial example to train our model will not be efficient for real-time applications like a self-driving car. Running through every possible image alteration will require huge computational time for inferencing and cannot be used in time-sensitive applications. 

Secondly, there are ways to always calculate your model’s threshold despite all the image alterations the model has been trained on. This is easily possible if the classification model is providing output in ‘probabilities’ of percentage. For instance, the model outputs the particular image is 99% aeroplane and 1% cat. The adversary can then tamper with his input to fool the model.

Suppose the model provides output in ‘classes’, that is, it only tells whether it is a cat or not, the adversary can still use the model to train their substitute models that can reverse engineer to estimate the threshold for the original model and then introduce adversarial inputs accordingly.

Such attacks have been explored in the ‘Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples’.

Overcoming The Challenge

Scientists at MIT came up with an advanced iteration of reinforcement learning (RL), a type of machine learning that does not need labelling. The method is mostly used in algorithms involved in games like Chess or Go.

The current approach used by academics is called CARRL, which builds over an existing deep reinforcement learning DQN or Deep Q-Network. DQN is a neural network that allocates an input a Q-value or a level of reward. 

CARRL takes the input image as a dot and then considers an adversarial influence around the dot. The DQN analyses every possible position within this region to find an ‘associated action’ that would result in the most’ optimal worst-case reward.’

If there is a possibility of adversarial input, the approach considers the worst-case reward while coming to a conclusion. The developers of this algorithm experimented with their approach on a Pong game where two players operate paddles on either side. 

When the computer is playing fair, only a DQN approach could beat the computer. DQN will loose if the adversarial input changes the direction of the ball. But using CARRL, the RL approach put the paddle in such a way that it will be in a region based on the optimal worst-case reward.

Going Forward

The MIT researchers tested the application on two agents where one agent had to avoid colliding with the second to reach its destination. The first agent was successfully able to avoid collisions despite the adversarial attacks introduced by the second.

The scientists tried to increase the adversarial nature of the second agent. At a point, the first agent avoided going to its destination altogether. The scientists claimed this sort of conservatism is useful as it can be used as a limit to tune the algorithm’s robustness. The new method can go a long way in implementing AI where decision-making is time-sensitive, and the margin for error is critical.

More Great AIM Stories

Kashyap Raibagi
Kashyap currently works as a Tech Journalist at Analytics India Magazine (AIM). Reach out at kashyap.raibagi@analyticsindiamag.com

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

AI Good Teammate
Victor Dey
Can AI Be A Good Teammate?

Recently, researchers have been able to develop a few RL agents that can learn games from scratch through pure self-play without any human input.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM