When you think of machine learning models, two techniques come to mind immediately — supervised learning and unsupervised learning. The main difference between the two approaches is the labelled data– supervised learning has it, and the other don’t.
Both approaches have their shortcomings. Over time, scientists have introduced several techniques that offer the best of both. The two most popular ones are–self-supervised learning and semi-supervised learning.
Both techniques adopt a hybrid approach. That said, both are distinct.
Sign up for your weekly dose of what's up in emerging technology.
In the case of supervised learning, the AI systems are fed with labelled data. But as we work with bigger models, it becomes difficult to label all the data. Additionally, there is just not enough labelled data for a few tasks, such as training translation systems for low-resource languages.
In a 2020 AAAI conference, Facebook’s chief AI scientist Yann LeCun introduced self-supervised learning to overcome these challenges. This technique obtains a supervisory signal from the data by leveraging the underlying structure. The general method for self-supervised learning is to predict unobserved or hidden part of the input. For example, in NLP, the words of a line are predicted using the remaining words in the sentence. Since self-supervised learning uses the data structure to learn, it can use various supervisory signals across large datasets without relying on labels.
A self-supervised learning system aims at creating a data-efficient artificial intelligent system. It is generally referred to as extension or even improvement over unsupervised learning methods. However, as opposed to unsupervised learning, self-supervised learning does not focus on clustering and grouping.
It could even be seen as an autonomous form of supervised learning as it requires no human input in the form of data labelling. There are three significant advantages to self-supervised learning:
- Scalability: Supervised learning technique needs labelled data to predict the outcome for unknown data. However, it may need a large dataset to build models that make accurate predictions. Manual data labelling is time-consuming and often not practical. Here is where self-supervised learning helps as it automates the process even with large amounts of data.
- Improved capabilities: Self-supervised learning has significant applications in computer vision for performing tasks such as colourisation, 3D rotation, depth completion, and context filling. Speech recognition is another area where self-supervised learning thrives.
- Human intervention: Self-supervised learning automatically generates labels without human intervention.
Despite its various advantages, self-supervised learning suffers from uncertainty. In cases such as Google’s BERT model, where variables are discrete, this technique works well. However, in the case of variables with continuous distribution (variables obtained only by measuring), this technique has failed to generate successful results.
Semi-supervised learning is a combination of supervised and unsupervised learning. It uses a small amount of labelled data with a larger share of unlabelled data. Semi-supervised learning technique typically involves the following steps:
- First, training the model with a small amount of labelled data (similar to what is done in supervised learning) until the model gives good results.
- Using the model with unlabelled training or pseudo label dataset to predict the output.
- Link the labels from the labelled training data with the pseudo labels and the data inputs from the labelled training data with the inputs in the unlabeled data.
- Train the model in the same way as one would in the case of the fully labelled dataset.
One popular semi-supervised learning technique is by combining clustering and classification algorithms. Clustering algorithms are unsupervised learning methods that group data based on their similarities. These algorithms help in finding the most relevant samples in the data set. The samples can then be labelled and used to train the supervised learning model for a classification task.
Self-supervised vs semi-supervised learning
The most significant similarity between the two techniques is that both do not entirely depend on manually labelled data. However, the similarity ends here, at least in broader terms. In the self-supervised learning technique, the model depends on the underlying structure of data to predict outcomes. It involves no labelled data. However, in semi-supervised learning, we still provide a small amount of labelled data.