Last updated December 30, 2021
In AI Mysteries

A Guide to Deep Abstract Q-Network for Reinforcement Learning

Deep Abstract Q-Network can be considered an advancement of traditional deep Q-learning where, to an extent, it can enable the reinforcement learning agent to get trained in high-dimensional domains.

Share

Published on January 2, 2021

by Yugesh Verma

In the traditional Deep Q-learning approach, we find problems in learning with high dimensional domains with sparse rewards and long horizons. There are several domains that have long horizons and sparse rewards, which we find challenging when modelling with traditional ways. Deep abstract Q-learning, an advanced technique of reinforcement learning, can help in increasing the performance level to an extent. In this article, we will introduce the Deep Abstract Q-network with its working and applications. The major points to be discussed in this article are listed below.

Table of Contents

What is Deep Abstract Q Network?
How Does It Work?
Applications of Deep Abstract Q Network

What is Deep Abstract Q Network?

Deep Abstract Q-Network can be considered an advancement of traditional deep Q-learning where, to an extent, it can enable the reinforcement learning agent to get trained in high-dimensional domains. In many cases, we can see that in a high dimensional domain, the agent gets sparse reward signals and this sparsity produces difficulties for an agent to play games beyond its capabilities. There are various games where an agent is required to navigate in a complex or unknown environment and perform changes in some objects to achieve long-term goals.

Let’s understand this by taking an example of a long-horizon problem. Consider a high-dimensional domain, where an agent is assigned to perform navigation from a set of cluttered rooms using the visual points or input. Any room has its door locked and the agent can get the key from a known point of the domain or in a different unlocked room. In such a situation, the agent is required to perform two main tasks: navigate to several rooms so that it can find keys, retracing its steps to the locked room to unlock it.

So we can understand that learning the policy to traverse multiple rooms becomes harder in such a situation. In the current scenario, we don’t have any complete solution for the long-term planning of agents. However, there are a few approaches and works available which can be classified into two categories:

Novelty-based approach: In this approach, we mostly see that the agent gets motivated to explore portions of the state space that exhibit some form of novelty.
Abstraction-based approach: In this type of approach, we often see that the profit is made by some kind of abstraction that can divide a major problem into subparts.

So one can think of Deep Abstract Q-Network as a Q-network based on the abstraction-based approach. However, both of the above-given approaches have some limitations and drawbacks. The novelty-based approach indeed encourages exploration and the abstraction-based approach encourages end-to-end learning of both abstracted policy and its sub-policies.

In the Deep Abstract Q-network, we find that the management of long-horizon domains and sparse rewards is done through lightweight abstraction policies which can consist of some kind of factorized high-level states to the agent. In this type of network, the Markov Decision Process(MDP) is used to divide a domain into a representation for long-term policies that are symbolic and into pixel-based low-level representation for explaining the success of deep learning.

Talking about applying this to the above-given example, the symbolic representation can be the current state of the agent, like he has keys or the current room where the agent is, and the low-level representation can be the pixel value of the image. Conversion of the collection of symbolic representations of states into state attributes with predicate function is done by factoring. We can consider it similar to Object-Oriented MSPs. Factoring is the reason which makes the performance of the Q-Network, improved in the high dimensional domain.

For example, after applying it to the above example we don’t need to re-learn how to navigate from a room to another room after finding the key. Holding the keys should not change the way the agent navigates. Let’s discuss the model processing using which we can perform Deep Abstraction Q-Network.

How Does It Work?

As we have discussed above, our main motive is to make the agent learn so that they can exhibit long-term plans. So if we can couple two agents simultaneously in the abstraction MDPs framework, it can allow more levels of abstraction. Normally, the selection of the levels of domains is dependent on the dimension of the domain, but mostly, we see 2 levels of abstraction as sufficient for a variety of domains.

For instance, let’s say there are two agents:

High-level – L1-agent
Low-level – L0-agent

In such cases where the number of agent levels is two, the L0-agent can be used to operate on such states which are received directly from the environment, and the L1 agent can be used in learning from the abstraction. These abstractions can be provided by the experimenter. Using these abstractions, one can hold very little information about the environment which means that the L1 agent learns from the limited information. The limitation of the abstraction makes us perform minimal engineering on the part of the experimenter.

On the above points, we can say that the model processing can be done in two steps:

Abstracted state and actions: By providing a lower-dimensional abstraction from the ground level states to a high-level L1 agent, we can make our agent plan at a higher level. This process is very similar to object-oriented MDPs. Factorization prevents the state-action space from growing combinatorially in the number of objects.

Interactions Between L1 and L0 Agents: By defining the L0 reward function in terms of L1 abstract states, we can allow L0 agents to transit between L1 abstract states. Both of the agents can be in different state spaces which makes it difficult to define reward and terminal functions for each agent.

Suppose that the high-level agent is in an initial state and takes any action from the action set. The final state is the final result of applying that action at the initial state. So, we can say that this high-level action is the cause of the execution of any L0-policy. The modified terminal set and reward function can be as follows:

One thing to notice here is that the low-level agent reward function ignores the ground environment reward function.

Till now, we could understand how a model process holds the framework and functions for Deep Abstract Q-network. Let’s see the works where this approach of Q-network is used.

Applications of Deep Abstract Q Network

In the last section, we discussed what a Deep Abstracted Q network is. These networks are based on the abstraction-based approach where abstraction of long policies and sub-policies helped to allow agents to learn and plan more efficiently in a long-horizon domain. In this section, we will see some examples of work that are related to the above-given approach.

MAXQ: In this work, we can find that it decomposes a flat MDP into a hierarchy of subtasks. There is a subgoal assigned to each subtask. The policy required to be followed to complete the subtask is easier than the policy required to complete the entire task.
Option-Critic Architecture: This architecture is based on the temporally extended actions approach which also helps in dealing with long horizon problems. Here, it collects and bundles the reusable sub-plans into a single action that can be used alongside the environment action. These approaches are useful in dealing with Atari 2600 games environment but not that successful with Montezuma’s Revenge and Venture, which are long-horizon domains.
Hierarchical-DQN (h-DQN): This is a two-tiered agent Deep Q-Learning, but because of division in a low-level controller and a high-level meta controller, we can consider this work in this approach.
Deep Abstract Q-Networks: This is also a two-tiered agent deep Q-Learning but the learning procedure makes it different from the other works. Agents used in this work can be applied to any reinforcement learning program because it uses the classical MDP to make the agent learn. We can also witness levels of abstraction in this work.

Here we have seen some examples where we can say that as a basic approach, the abstraction-based approach has been applied to deal with the problem of the high-dimensional domain and sparse rewards.

Final Words

In this article, we have discussed what Deep Abstract Q-network is and how it works. We have also discussed some of the examples where this type of network and different similar variants have been used for different purposes.

References: