The Societal Implications of Deep Reinforcement Learning

Machine learning (ML) can handle many complex tasks than just output singular decisions based on a labelled training dataset. Reinforcement learning (RL), a subset of ML, can train an agent to learn through interaction with the environment and use trial and error methods to alter its behaviour based on feedback. While the RL model can make decisions using a large table, modern RL applications are far too complex for the tabular approach to suffice. Deep Learning (DL), another subset of ML, can come in handy here: It uses a large matrix of numerical values to produce an output through the repeated application of mathematical operations. 

A deep reinforcement learning (DRL) model is a combination of RL and DL, where RL helps the network learn which actions to take based on the inputs or rewards, and DL will scale it for more complex environments. A DRL system works best when they have a well-defined environment, a clear reward function, plenty of data, and lots of computing power. Most DRL models do not check all the boxes, limiting the widespread deployment of DRL. 

Despite these limitations, DRL models enjoy a good track record. On the other hand, DRL models are raising a few concerns as well. According to a study published in the Journal of AI Research, the situation will worsen as DRL makes strides.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Human Oversight

According to the Ethics Guidelines for Trustworthy AI, published by the European Commission, AI systems must allow human oversight. DRL’s complex applications aim to bring more autonomy to machines and could potentially be used in systems that make hundreds of decisions in a short period. This poses a challenge since it is unclear how human oversight could look like in such a context.

DRL systems that can learn continually after deployment pose another challenge. These systems will learn at a pace that is challenging for humans to keep up with. The question of how humans will be able to maintain oversight on continually learning systems has not been addressed in existing ethics and governance proposals. 

Download our Mobile App

Humans should at least be able to impose constraints when the system is being designed or review decisions after they have been made.

Safety and Reliability

RL is the core of DRL and is trained using a trial and error method, which means an application learns from its mistakes. In some scenarios, mistakes are unaffordable. For instance, if a Cobot working alongside humans decides to move his hand in the wrong direction could have fatal repercussions.

Engineers need to develop new methods like training algorithms using simulations to make the models robust to avoid harm. However, this will not be useful for DRL systems that continue to learn from data after deployment. Continual monitoring of such systems, along with continuous testing, evaluation, and validation, is critical. Explainable AI models that are inherently easier to understand could help in these situations.

Harms From Reward Function Design

RL is based on rewards and can pose a threat when specific reward functions in the DRL systems result in abnormal behaviour. For example, an RL agent was trained to play a boat race game. The boat was incentivised to finish the race, but it ended up knocking its opponents to gain points but never really finished the race.

Sometimes, even if the behaviours are correct, there may be chances of broader consequences. For instance, social media content-selection algorithms are optimised to show content a user is most likely to click. But the ubiquitous use of these algorithms has instead ended up changing users’ preferences. The algorithms start to traffic in extreme polarising content, leading to filter bubbles, echo chambers and resurgences of fascism and crumbling of economies.

Such consequences are a result of companies focusing on specific objectives rather than individual or societal interests. To address this issue, DRL systems will require a broader sense of what responsible and ethical reward functions should look like.

Incentives For Data Collection

DRL is dependent on huge amounts of data. Up-to-date and expansive data collection of people and societies presents several concerns regarding data privacy violations or mass surveillance. 

For instance, DRLs used in smart cities. As cities get digitised, effective DRL systems could help with better resource management, but they could also intrude on people’s privacy by collecting sensitive data. Lack of transparency in data use can further lead to a lack of accountability. And the collected data could be used for different purposes than it’s intended for. India’s Aadhaar card is a good case in point. Also, the collection of personal data could disproportionately affect vulnerable communities.

Security And Potential Misuse

DRL systems will live up to their potential when deployed in real-life, time-sensitive applications like self-driving cars. In such cases, these systems will have to be robust, not only to various situations they could face in the real world but also to adversarial inputs, like a fake traffic sign. Research has shown that it is harder to distinguish adversarial attacks in DRLs than other ML approaches since the training data is continuously changing. 

DRL systems also risk potential misuse. For instance, they could be used for spreading disinformation. DRL algorithms can enable this with more accuracy and speed.

Automation And Future Of Work

Several studies suggest low-wage or manual jobs are unlikely to get automated because of the skill and mobility needed for such jobs. However, with the advances in DRL and other advanced technologies like robotics, SLAM or embedded vision systems, the automation of these jobs could be accelerated. On the flip side, DRL systems could also reduce human labour. It could, for example, automate tedious and dangerous aspects of manual labour.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Kashyap Raibagi
Kashyap currently works as a Tech Journalist at Analytics India Magazine (AIM). Reach out at

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Foxconn Conning India?

Most recently, Foxconn found itself embroiled in controversy when both Telangana and Karnataka governments simultaneously claimed Foxconn to have signed up for big investments in their respective states