An Introductory Guide to Meta Reinforcement Learning (Meta-RL)

Meta Reinforcement learning(Meta-RL) can be explained as performing meta-learning in the field of reinforcement learning. where including meta-learning models in reinforcement learning we can grow the model to perform a variety of tasks.

Most of the AI-based applications are designed for a specific task and they have limitations to adapt to the new tasks. If they gain and leverage the knowledge about how they were trained for a particular task, it can be applied to adapt to any new learning environment. Meta Reinforcement learning is an approach for the same. In this article, we will be discussing the advancement of reinforcement learning which is known as meta reinforcement learning (Meta-RL). Meta-RL can be considered as the new technology which combines the effectiveness of meta-learning and reinforcement learning technology together. The major points to be covered in this article are listed below.

Table of Contents

  1. What is Meta-learning?
  2. What is Reinforcement Learning?
  3. What is Meta Reinforcement Learning?
    1. Difference Between Meta-RL and RL
    2. Training Procedure of Meta Reinforcement Learning
    3. Components of Meta-RL

What is Meta-Learning?

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

As we know that metadata can be defined as the data which provides information about the data, more formally we can say that metadata is data about the data. Similarly, when we talk about meta-learning we can say it is the process of learning to learn. Where a model in meta-learning should have capabilities of adoption to the new environments which are not encountered on the training time of the model. 

In the process of learning the adaption takes part as a mini learning session which mostly occurs in the testing times where the configuration of the new task has some limited exposure such that the model can also complete the new tasks. There can be roughly three major qualities of a meta-learning model:

  • It should be able to classify the new data where it has a lower amount of data in the mini learning session.
  • They should have the ability to adapt according to the environment.
  • They should be prominent in the new task which they have learned during the testing. 

Image source

The above image is a representation of the model which is trained for the classification of images in two kinds of datasets and it can perform the classification on the four shots but at the testing, it has got a different data set which it needs to learn.

An optimal meta-learning model is trained with a variety of datasets and they are trained to perform various tasks of similar classes. Also, they should be optimized on the basis of performance regarding the distribution of all tasks where the potential unseen tasks are included.  

As we know that in normal learning procedures the data where the model needs to be trained comes with its own task and the information under the data comes with its labels. When we talk about meta learni9ng the procedure of learning is similar to normal learning the only major difference between them is that in the meta-learning the whole dataset set can be considered as the data sample and the model needs to learn from the whole dataset while the learning procedures are based on the learning using the information of samples presented under ta dataset. 

An optimal parameter of the meta-learning model can be represented as follows.

It is pretty similar to the normal machine learning procedure parameters but the task is associated with the dataset D.

A basic meta-learning model can have the following major components:

  • Learner: Which is a basic machine learning model which is trained to perform a given task like classification.
  • Optimizer: The basic work of the optimizer is to optimize and update the learning according to the new information or input in the model.

Since the main focus of the article is to provide an overview of the reinforcement meta-learning models and procedures we are not discussing the meta-learning in-depth but a simple explanation of the meta-learning we have in this section and similarly, we will have a simple explanation of the reinforcement learning in the next section. 

What is Reinforcement Learning?

In machine learning, reinforcement learning can be considered as one of three learning procedures in the field of machine learning alongside unsupervised and supervised learning where it works as the basis of a reward system. As we know in the marketing field the agents of any system need to perform tasks so that they can earn rewards accordingly. Similarly, when we talk about reinforcement learning we also have agents, tasks, and rewards. 

Let’s say the agent is in an unknown field or environment and there is a reward that is fixed for the agent when it interacts with the environment. Simply, the agent needs to take actions in the environment to make more or maximize the cumulative rewards. A basic example of this learning can be a bot that is trying to play a game and the aim of the bot is to achieve a high score. 

Image source

The above image is a representation of the reinforcement learning system where an agent is interacting with the environment and taking actions to maximize the cumulative rewards.

When we talk about the goal of reinforcement learning is to learn an optimal strategy that can be followed by the agents and the optimal strategies are obtained by the experimental trials and the feedback from the agents about the strategy. The maximization of the cumulative rewards done by the agent following the optimal strategy also helps the agent to get adapted according to the environment. 

Let’s say that the model is integrated into an environment and an agent is performing in the environment. The action from the agent is defined by the model where the agent can be in many states of the environment and can act accordingly which means the agent has many actions to perform from which it can decide which one is to perform which can be decided by the transition probability between states of the environment. And once the action is performed by the agent it gets a reward and the environment gets feedback. 

By the above consideration, we can say that an RL procedure has the following major components:

  •  Environment 
  • Model
  • State
  • Action
  • reward

The model is a deciding component of the RL where the function of the model is to define the reward function and transition probabilities of the basis of that we can say there can be two types of RL:

  • Model-based RL
  • Model-free RL

Where model-based RL works on planning with perfect information and model-free RL works on learning with incomplete information.

Here we have understood the basic idea behind meta-learning and reinforcement learning. We can consider Meta reinforcement learning as the combination of both meta-learning and reinforcement learning. In the next part of this article, we are going to discuss Meta reinforcement learning.

What is Meta Reinforcement Learning?

Meta Reinforcement learning(Meta-RL) can be explained as performing meta-learning in the field of reinforcement learning. The normal models in reinforcement learning get trained and tested on the same set of problems. where including meta-learning models in reinforcement learning we can grow the model to perform a variety of tasks.  

Let’s start with the formulation so that we can get a proper overview of the meta-RL. Let’s say we have formularized distribution of task according to the Markov decision process(MDSP) where  the Markov decision process can be determined by the tuple, Mi=⟨S, A, Pi, Ri


  • S is a set of states
  • A is a set of actions
  • Pi is the transition probability function
  • Ri is the reward function
  • Mi is the distribution of the task, each formularized as MDP

Where test tasks for meta-learning are from the same distribution M or it can be modified according to some research where horizon T is added in the tuple. 

Image source

The above image is a representation of the Meta-RL where the agent interacts with the environment and performs actions for getting maximized rewards and this behaviour of the agent is determined by the inner loop. The outer loop of the meta-RL helps in optimizing the new environment which allows adjusting the parameter of the model so that it can define the agent’s behaviour. However, we can say RL and Meta RL are almost the same when we have an overview about them but we can find out the difference between them by going in-depth.

Difference Between Meta-RL and RL

As we have seen in the above section the architecture of the RL and Meta RL is similar but there is one major difference between them which is the meta RL includes the last rewards earned by the agent and the last action performed by the action into the observation policy along with the current state of it where RL includes the only current state of it into the observation policy.

  • In RL: πθ(st)→ a distribution over A
  • In meta-RL: πθ(at−1, rt−1, st)→ a distribution over A

The basic intention behind the inclusion of the rewards and the action from history is to make the policy to understand the relations and the dynamics between the current state and the last action and rewards of the agent in the current MDP. 

Using the understanding of the policy can alter the strategy which can be considered as an entrance of the meta-learning into reinforcement learning. Many types of research in this field show the implementation of the LSTM layer for memory setting and we can say using the LSTM layer allows us to prevent us from feeding the last state as inputs explicitly.

Training Procedure of Meta Reinforcement Learning

From the above, we can say that the training procedure of the meta-RL model can be completed into four steps as follows:

  1. Select a new MDP
  2. Reset the hidden state of the model
  3. Collect multiple trajectories and update the model weights;
  4. Repeat the above-given steps

Components of Meta-RL

In the basic idea, we can say there can be three major components of the Meta-RL:-

  • Model 

As the model in RL and Meta RL is responsible for deciding the reward function and transition probabilities we can use an RNN which maintains the hidden state and can be used for occurring knowledge from history and apply it to the current state.

  • Meta-Learning Algorithm 

The major work of the meta-learning algorithm is to update the model weights. This update helps in optimizing the level of providing a solution to a new task according to the whole algorithm. Usually, we use an ordinary gradient descent update of RNN with a hidden state. The algorithm basically works when the MDP is switched off.

  • A Distribution of MDPs

As stated above, we use MDP when the agent is required to perform an action in multiple environments, and also it can perform multiple actions during training, so the agent is also required to learn the adaptation regarding different MDPs.

As we know about the general rule of machine learning that we can make a model master on a wider range of variance when applied with weak inductive bias but also it gets less sample efficient. Generally in RL, the learner uses the assumption about the new input to make the prediction which makes the RL slow while using the meta-learning in the RL we can impose a variety of inductive biases from the task distribution and using the model we can save the bias in the memory. Where the adoption from the agent is done by the testing time which is dependent on the algorithm. Since the working of the RL model is slow it allows the model to work on the new task fast with the use of meta-learning under it.   

Final Words 

In the article, we have seen an overview of meta-learning and reinforcement learning and also we have seen how it works when both of them get combined. Basically, the meta-RL is the entrance of the meta-learning into reinforcement learning so that the reinforcement learning can get a bit stronger than before. 

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox