In the recent scenario, reinforcement learning is an evolving field where we can see that various large research groups like OpenAI and DeepMind have performed various formalities and implementations of it. On the other hand, the behaviour trees are originally developed in the gaming industries that are mainly used for performing actions or sets of actions in a managerial way. In this article, we will go through the intuition of behaviour trees and following that we can use this tree in reinforcement learning. The major points to be discussed in this article are listed below.
Table of Contents
- What is a Behavior Tree?
- What is Reinforcement Learning?
- Behaviour Tree In Reinforcement Learning
Let’s start the discussion by understanding the behaviour tree.
What is a Behavior Tree?
In recent years, we have witnessed that performing tasks using artificial intelligence are getting complex. Designing an artificial intelligence agent needs to learn an extended list of actions ahead of time. This list of actions and the way to solve those actions through an agent requires a planning algorithm.
In the planning algorithms of an agent, behaviour trees can be considered as a way to construct, control and structure the action or task-related code. Using the behaviour tree we can effectively manage a large codebase through hierarchies and modularity, and emphasize human readability and reusability. The below image can be considered as a representation of a basic behaviour tree.
In the above example, the purpose of the agent is to pick up the phone, answer the call and cut the call and put down the phone.
As the name suggests, the behaviour tree is a tree structure similar to the other type of tree. Where the root node can be considered as the starting point of the execution of a tree. The tree is usually operated by the query signal generated on the root node. These signals are then spread throughout the tree and the running status of the signal is stopped by the success or failure remark of any node. Running status represents the ongoing process in the node and the final remark represents the corresponding result of the node.
The final remark from the node returns to the parent node by backpropagation where parents can decide to seek another node or to propagate the signal to further child nodes of its child node. By backpropagation, the result of every node reaches the root node and this causes the generation of new query signals. This all works in a cyclic loop, continuously repeating the execution of the tree. The responsive behaviour of the tree to the environment is generated by the cyclic loop which we have discussed.
According to the above working of nodes in the behaviour tree, we can divide the nodes in this tree into three categories:
- Execution Nodes
- Control Flow Nodes
- Decorator Node
Execution Nodes: In the tree, Leaf nodes can be considered as execution nodes that are responsible for acting or defining conditions using the query signal. These nodes can be of two categories:
- Action Nodes: These are the nodes that are responsible for holding the definition of all the low-level behaviour that can be utilized to control any agent. We can say these nodes in the tree are made up of codes of any defined sub-task which a behaviour tree needs to perform. There can be some conditions defined in the action nodes for the execution of any action. Using these conditions an action node can define the success or failure remarks of the process and also represent the running status if the process is taking so long time for execution.
- Condition Node: The condition nodes are nodes that indicate a predicate that is checked when it gets a query signal. The truth-value of the defined predicate is the reason for representing success or failure. Conditions are modelled by considering the observation done by the agent in the environment and the internal state of agent memory. If any change happens in the environment, condition nodes allow the agent to decide the action with the help of control flow nodes.
Control Flow Nodes: These nodes are non-leaf nodes with at least one child node. These nodes are responsible for defining relationships between different nodes. These can also be used to represent sub-branches of the tree. All types of nodes can be the child nodes of control flow nodes. We can categorize control flow nodes in the following categories:
- Sequence: These nodes iterate procedure until a failure or running status doesn’t return from a child node. This means these nodes keep their child nodes iterating until the success status doesn’t appear.
- Fallback: These nodes have the functionality inverse to the sequence nodes. This means iteration works until success, or running status is returned from a child node.
- Parallel: These nodes are responsible for the execution of the child nodes all the time. They return a Success if all children return a Success or a Failure if any child returns a Failure.
Decorator Node: These nodes can be considered as a mix of the execution and control flow nodes. It is mainly responsible for defining policy for the timings when a child node gets the query signal. These nodes can be used for describing the custom logic and these nodes are dependent on the specific implementation. An example of custom logic can be an inversion node where the work of an inversion node is to change the status of a node failure to success or vice-versa.
The below table can be a representation of the summary of all the nodes in the behaviour tree.
As we came to know, the motive behind the behaviour tree is to order the actions for an agent in a sequential way. By taking the above example of answering the phone we can say that the agent will try to pick up the phone, if the agent fails it will try to pick up the phone again and if it is successful it will try to answer and cut the phone after completing all the tasks. The agent will finally put down the phone. One thing is noticeable here if the agent fails in any of the tasks, it doesn’t mean that the process is stopped. An agent will try again and again to perform the task until it succeeds.
In this section, we have a brief introduction to the behaviour tree and understand following the nodes how we make an agent perform any required action. In the next section, we will look at a small introduction to reinforcement learning so that we can generate a better understanding of the motive of this article.
What is Reinforcement Learning?
As briefly explained in this article, reinforcement learning is a section of machine learning where we make a computer learn to perform any task using a through repeated trial and error interaction with an environment that can be dynamic. Going in the deeper direction RL problem is about maximizing
rewards by acting in an unknown environment. In the procedure, an agent performs and chooses the action to be performed in the environment. As a result of performing an action, an agent earns the reward that indicates good it is to transition to a particular state.
Description of combined rewards can be formulated by introducing the Markov reward process and Markov decision process. Let’s start by understanding the Markov process.
Markov Process: if in any process, the end of the process in the next state, depending on only the current state of the process, can be considered as a Markov process. By the formula, a Markov process is:
P(St+1|St) = P(St+1|S1, .., St)
S = state
P = probability
Markov Reward Process: The Markov Reward Process is a process of adding the reward from an agent wherein summarization we can say adding a reward for transitioning to each state and maximizing that reward for the next as well as all future states as in the form of a Markov Process.
Bellman equation can be used to express the addition of reward in the following way:
v(s) =E[Gt|St = s] =E[Rt+1 + γGt+1|St = s] =E[Rt+1 + γv(St+1)|St = s
S = sum of the rewards
Y = discount factor
Markov Decision Process: It is similar to the Markov reward process but here in addition the decision that can be made by an agent regarding any action. Decisions in the RL are dependent on the policy defined by the human. Using the MDP we can state-value function and action-value function as following:
vπ(S) = Eπ(Gt |St = s)
The state value function can be considered as the expected return from the state and acting according to the policy,
qπ(s, a) = Eπ(Gt |St = s, At = a)
The action-value function can be considered as an expected return from the state when taking action and acting according to the policy.
From the above definition, we can define a reinforcement learning setting as a procedure to train an agent to learn using the action space, and talking about the behavior tree we can consider it as an action space. In the next section, we will see how a behavior tree can be used in reinforcement learning.
Behaviour Tree In Reinforcement Learning
In the above section, we have introduced the behavior tree and reinforcement learning. Now if we talk about the behavior tree we can say that different nodes of the tree are responsible for completing any action by an agent and by reinforcement learning we can say that it is the whole procedure to make an agent work on a procedure by learning from the environment while grabbing rewards. Talking about the combination of behavior tree and reinforcement learning we can perform it in the following ways:
- In the categorization of the nodes of the behavior tree, we have discussed the control flow node. This node can be used in the action space of the RL algorithm. Generally, for this purpose, we use the action nodes. Using the action node we can utilize the benefit of managing the readability and modularity of any action from the set or hierarchy of control flow nodes.
- Child nodes of the fallback nodes can also be used as action space in the RL algorithm. With this, we can also reorder the nodes according to the probability of action. A basic idea behind the procedure is to provide a reward to an agent whenever a condition from the original tree is triggered and after this reorder the fallback node according to the Q-values for that particular state. This approach can be considered as a process of reward propagation alongside the behavior tree status.
- After learning the problem, execution nodes can be used to embed the whole RL problem in a single execution node. Because of less reactiveness and less control in the tree, this combination of RL and behavior tree can be applied only to relatively small and specific actions.
- Sequence nodes can be treated as the action space for the behavior tree. These actions can be used to train the RL algorithm and a table with Q-value is generated. After completion of training, a sequence can be assigned to the condition node. Each Condition has a look-up table of states which correspond to a good utility (high reward) for its corresponding action
In this section, we have seen some of the ways where we can combine the behaviour tree in the RL.
In this article, we have discussed the behaviour tree and with an example, we have discussed the working of it with the categories of the nodes in the behaviour tree. Along with that we have discussed a small introduction to reinforcement learning and how behaviour trees can be used in reinforcement learning.