Unsupervised learning can be considered as the approach to learning from the huge amount of unannotated data and reinforcement learning can be considered as the approach to learning from the very low amount of data. A combination of these learning methods can be considered as unsupervised reinforcement learning which is basically a betterment in reinforcement learning. In this article, we are going to discuss unsupervised Reinforcement learning in detail along with special features and application areas. The major points that we will discuss here are listed below.
Table of Contents
- About Unsupervised Reinforcement Learning
- Leveraging Unsupervised Learning for Reinforcement Learning
- Unsupervised Learning to Speed-Up Reinforcement Learning
- Examples of Unsupervised Reinforcement Learning
Why Unsupervised Reinforcement Learning?
Unsupervised reinforcement learning is a combination of unsupervised learning and reinforcement learning. We can divide machine learning approaches into three subfields – supervised learning, unsupervised learning, and reinforcement learning. The basic definition of these subfields are as follows:
- Unsupervised Learning: It is a process of learning from a huge amount of unannotated data.
- Supervised Learning: It is a process of learning from a medium amount of data with annotated values.
- Reinforcement learning: It is a process of learning from reward signals. These rewards can be given by either the environment or humans in the form of a small amount of data.
The below image depicts a comparison of these subfields in the context of data size, using a picture of cake where the bulk of the cake can be considered as unsupervised learning, the icing on the cake(medium on size) can be considered as supervised learning and the cherry on the top of the cake can be considered reinforcement learning.

As in the last section, we have discussed the basics of these three forms of learning, we can also divide these forms of learning on the basis of active and passive learning as:
With Teacher | Without Teacher | |
Active | Reinforcement learning | Intrinsic reward optimization |
passive | Supervised learning | Unsupervised learning |
From the above table, we can say that,
- If the learning is active with the teacher, it can be considered as reinforcement learning that is based on extrinsic reward optimization.
- If the learning is active without a teacher, we can call it learning based on intrinsic reward optimization.
- If the learning is passive with a teacher, It can be considered as supervised learning.
- If the learning is passive without a teacher, It can be considered as unsupervised learning.
When we talk about the basic process followed by unsupervised learning, we define objective functions on it such that the process can be capable of performing categorization of the unannotated data or unlabeled data. There are various problems that can be dealt with using unsupervised learning. Some of them are as follows:
- Label creation, annotation, and maintenance is a changing discipline that also requires a lot of time and effort.
- Many domains require expertise in annotation like law, medicine, ethics, etc.
- In reinforcement learning, reward annotation is also confusing. There are many questions like which kind of reward can be used, a continuous-valued reward or categorical valued reward, sparse rewards, or dense rewards.
- Collecting human behavioural data for behaviour cloning is also challenging because there is no annotated information for such data.
The above-given problems can be solved using unsupervised learning and here we find the answer of why there is always a point where unsupervised learning and reinforcement learning will cross each other. Most of the time, reinforcement learning is required when we try to perform some human tasks using robots. Unsupervised learning is inspired by how human infants learn.
So we can understand now why we require unsupervised learning in reinforcement learning. There will always be a possibility that we will be able to make a robot learn skills, not tasks that can be done using unsupervised learning and reinforcement learning together.
Leveraging Unsupervised Learning for Reinforcement Learning
In the above section, we have seen what and why unsupervised learning. When we talk about the divisions of unsupervised learning, it can be performed in the following way.

We can see that we have two basic divisions of unsupervised learning, generative and non-generative. In the context of the generative models, these models are used for learning about the probability distribution from a large amount of data of how the world behaves if we perform certain actions in the environment.
These models can be used to generate fake data or for planning according to the behaviour. Using them for reinforcement learning can be considered as the process of performing reinforcement learning when we know about the reactions which can help in earning rewards for agents.
On the other hand, when performing reinforcement learning, we can use non-generative unsupervised learning as external learning to speed up the task performed by reinforcement learning. The following chart is the sum-up of how to use UL in RL.

Here in the above representation, we have seen that in two ways we can best use UL for RL. Using the generative models of unsupervised learning in reinforcement learning is a separate area of study. In this article, we are just generating an overview about “why it is important to use non-generative UL for RL”. Let’s go through some points to know “Why to use UL as an auxiliary task to speed up RL?”.
Why use UL as an Auxiliary Task to Speed-up RL?
The below-given points are the answer to the above question.
- It is the simplest way of combining UL and RL.
- We can maximize rewards and learn about the environment using the same shared network.
- Representations learned using the UL are very helpful for the RL tasks.
- The success of divisions from unsupervised learning for label efficient supervised learning.
- Challenge: What is the right UL objective that can work well with RL? How to capture the useful aspects from very high dimensional data?
In a generalization of these points, we can say using unsupervised learning, we can make Label creation, annotation, and maintenance processes well-performing for a specific kind of data from the whole data. This helps RL to perform at high speed by making RL concentrate on that specific kind of data.
Examples of Unsupervised Reinforcement Learning
In this section of the article, we are going to discuss examples where different forms of unsupervised learning have been applied to the RL. Let’s take a look at the forms of unsupervised learning with their belonging divisions.
- Generative Unsupervised Learning
- Autoencoder
- Variational Autoencoder
- Non-generative Unsupervised Learning
- Contrastive learning
- Siamese networks
- Data-augmentations
Some of the work related to the generative unsupervised learning for reinforcement learning are as follows:
- UREAL Architecture from DeepMind: Here we can see reinforcement learning agents have achieved high-level performance in various games using the variational autoencoder.
- DARLA (DisentAngled Representation Learning Agent): DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios where interest data is hard to obtain, for which they have basically used autoencoder which extracts required data from a high dimensional unannotated data.
- World Models: These models can be trained quickly in an unsupervised manner using the variational autoencoder to learn a compressed spatial and temporal representation of the environment.
Let’s see some of the works from the non-generative UL in RL.
- CURL (Contrastive Unsupervised Representations for Reinforcement Learning): This work is capable of extracting high-level features from the raw information using contrastive learning and performing reinforcement learning on extracted features.
- DADS (Unsupervised Reinforcement Learning for Skill Discovery): DADS designs an intrinsic reward function that encourages the discovery of “predictable” and “diverse” skills. The intrinsic reward function is high if the changes in the environment are different for different skills (encouraging diversity) and changes in the environment for a given skill are predictable.
- Reinforcement Learning with Augmented Data: This is a simple plug-and-play module that can enhance most RL algorithms by performing an extensive study of general data augmentations for RL on both pixel-based and state-based inputs, and introducing two new data augmentations.
Final Words
In this article, we have seen how by combining unsupervised learning and reinforcement learning we can make it unsupervised reinforcement learning. By the points given in the article, we know why unsupervised learning is a player in the process and we have seen how to best use UL for RL. Along with this, we have discussed some of the examples based on unsupervised reinforcement learning.