Reinforcement learning is built on the mathematical foundations of the Markov decision process (MDP). It’s critical to compute an optimal policy in reinforcement learning, and dynamic programming primarily works as a collection of the algorithms for constructing an optimal policy. Unlike the classical algorithms that always assume a perfect model of the environment, dynamic programming comes with greater efficiency in computation. In a finite-state reinforcement learning environment, we can represent the state, action, and reward sets as St, Ac(St), and R, for st ∈St, where the states are finite. The probability of the environmental dynamics provided by the set of the probabilities p(St, r|St, Ac), for all the elements st ∈St, ac ∈Ac(st), r ∈ ℝ, and st𝜄 ∈ St+ , St+ can be represented as a terminal state of multiple iterations in episodes. The dynamic programming in a reinforcement learning landscape is applicable for both continuous and discrete state spaces. Dynamic programming explores the good policies by computing the value policies by deriving the optimal policy that meets the following Bellman’s optimality equations.
We need to compute the state-value function GP with an arbitrary policy 𝞹 for performing a policy evaluation for the predictions.
𝞹(ac|st) 🡪 Probability in the environment for taking action ac for state st with policy 𝞹
The computation of value-state function GP is for the exploration of the best policy, the policy improvement Kimprove defined as:
We can apply policy improvement by expanding Kimprove𝞹 iteratively till there is an improvement.
The dynamic programming works better on grid world-like environments. The objective of the agent in the gridworld is to control the movement of the characters. Some of the tiles in the gridworld are walkable by the characters, while other tiles may lead the characters/agents to fall inside the water of the frozen lake. The ultimate objective of the agent is to find the goal tile by finding the most optimal walkable path. Every time the agent finds the walkable path to the goal, the agent is awarded.
The following are the key components to watch out for in the gridworld.
S 🡪 Starting position (Safe)
F 🡪 Frozen surface (Safe for some time)
H 🡪 Hole (Death)
G 🡪 Goal (Safe and ultimate goal).
The agent can perform the following actions in the frozen lake environment
- Left – 0
- Down – 1
- Right – 2
- Up – 3
We will implement dynamic programming with PyTorch in the reinforcement learning environment for the frozen lake, as it’s best suitable for gridworld-like environments by implementing value-functions such as policy evaluation, policy improvement, policy iteration, and value iteration.
Import the gym library, which is created by OpenAI, an open-source ecosystem leveraged for performing reinforcement learning experiments. In the following step, we register the parameters for Frozen Lake and make the Frozen lake game environment, and we print the observation space of the environment.
Assign the observation space to a variable and print to see the number of state spaces available in the environment.
We will sample the grids from 0 to 15 from the observation space for a range of g. The total grids that are possible in the environment are from 0 to 15.
Then, we print the action space for the agent to find the walkable path in the shortest amount of time with optimal policy.
We can find the possible actions by the agent in the Frozen lake environment from the action space by sampling the actions for a range of 15.
We then render the environment to explore the current state of the environment
We can navigate in the frozen lake environment of the gridworld by going left by executing the action as zero. This will not result in a penalty, as there’s nothing on the left side. We should be able to navigate down with action as one, and going to the right should not cause a problem either with action two, as the agent will still be standing on the surface of the frozen lake that does not cause any problem, and we can go to the right twice by executing the action as two, that should be safe as well as the agent steps on the frozen lake’s surface. However, going down thrice, the agent encounters death as the agent will fall through the hole into the frozen lake. The agent is not likely to survive, recover, and swim back from the hole; it’s a high risk unless the agent has exceptionally overcome near-death experiences. For the sake of the frozen lake game, we can consider, the agent will die.
To navigate successfully inside the gridworld of the frozen lake environment, the agent has to navigate to the right twice, and down thrice, and go right once to reach the goal.
Join Our Discord Server. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Dr Ganapathi Pulipaka is Chief AI HPC Scientist and bestselling author of books covering AI infrastructure, supercomputing, high-performance computing for HPC, parallel computing, neural network architecture, data science, machine learning, and deep learning in C, C++, Java, Python, R, TensorFlow, and PyTorch on Linux, macOS, and Windows.