To achieve better performance, deep neural network-based semantic segmentation typically requires large-scale cost extensive annotations for training. Some researchers have recently attempted to use object-level labels (e.g. bounding boxes) or image-level labels to avoid pixel-wise segmentation annotations, which are required for most methods (e.g. image categories). So, in this article, we will talk about how to segment images at the image level using the image-level supervision approach. Below are the major points to be discussed in this article.
Table of contents
- Semantic segmentation
- What is instance segmentation?
- Types of supervision for segmentation
- Working methods
Let’s start the discussion by understanding semantic segmentation.
Semantic image segmentation is the problem of assigning an image’s pixels to a predefined set of labels based on the semantic structure to which the pixel belongs. For computing the probability distribution over the classes for each pixel, most successful models for semantic image segmentation generally use a variation of CNN.
During inference, these distributions are fed as unary potentials to fully connected conditional random fields (CRF) with Gaussian edge potentials. The CRF is used to infer joint labelling for the image’s pixels. Conditional random fields (CRFs) are the statistical modelling tool used for structured prediction in pattern recognition and image processing.
Successful semantic image segmentation necessitates access to a large number of densely labelled images. Dense labelling of images, on the other hand, is an expensive and time-consuming process. As a result, the number of densely labelled images available is typically a negligible proportion of the total set of images. As a result, models that rely solely on densely labelled images have a limited scope. In the sequel, these models will be referred to as fully supervised models.
Due to the limitations of fully supervised models, models that can incorporate weakly labelled images for training have been developed. These include models that use a bounding box prior, a small number of points per class and image-level labels. Models that rely solely on image-level labels are of particular interest, as the web provides an almost limitless supply of poorly annotated images.
In the following section, we’ll look at some recently proposed model that learns to generate segmentation masks from image-level labels alone, without the help of localization cues or saliency masks. Before that, we’ll go over instance segmentation and different types of supervision for segmentation, as they’re both relevant.
What is instance segmentation?
One of the most difficult tasks in computer vision is instance segmentation. However, obtaining the required per-pixel labels required by most instance segmentation methods is time-consuming and expensive. Current approaches to overcoming this issue rely on weaker labels (such as image-level labels) and pseudo labels obtained through object proposal methods.
While the majority of these methods are for object detection and semantic segmentation, the task is to categorize each object pixel and distinguish between object instances. Most recent methods rely on deep networks and work in two steps, first detecting objects and then segmenting them. Mask-RCNN, for example, employs Faster-RCNN for detection and an FCN network for segmentation.
Types of supervision for segmentation
Because obtaining per-pixel labels is time-consuming, many weakly supervised methods have emerged that can use labels that are much cheaper to obtain. Bounding boxes, scribbles, points, and image-level annotation are all examples of labels. The dataset in the weakly-supervised setting, on the other hand, consists of images and associated annotations that are relatively easy to obtain, such as tags/labels of objects in the image.
Image-level labels as weak supervision
Because of its low cost, acquiring image-level labels is an appealing form of annotation. The annotator only needs to say whether or not a particular object class appears in an image, not how many of them there are. While this type of annotation is gaining popularity in academia, the majority of the proposed methods are for semantic segmentation.
Only recently did a few works for this problem setup surface. Using the Class Activation Map (CAM), we were able to identify not only a heatmap that roughly represents the regions where objects are located but also peaks on that heatmap that represent the locations of different objects.
In this section, we’ll briefly describe two image segmentation models based on image-level supervision.
Segmentation by pseudo labels
This method is proposed by Issam H. Laradji et al that can effectively train with image-level labels, which are much less expensive to obtain.
Fundamentally, the Weakly-supervised Instance SEgmentation method (WISE) builds on the Probabilistic roadmap method (PRM) by training a fully-supervised method, Mask R-CNN, with its output pseudo masks. Because Mask R-CNN is potentially robust to noisy pseudo masks, and the noisy labels within these masks may be ignored during training because they are potentially uncorrelated, this procedure is effective.
Below is the architecture of this method when it is being trained.
The first component (shown in blue above) learns to classify the images in the dataset. The classifier generates a class activation map (CAM) first and then uses a peak stimulation layer (PSL) to obtain the CAM’s local maxima. The classification loss is computed using the average of these local maxima to train the classifier.
Because the CAM peaks represent located objects, it chooses a proposal for each of these objects in order to generate pseudo masks. The second component (shown in green) uses these pseudo masks to train a Mask R-CNN.
To summarize, this approach to instance segmentation with image-level supervision consists of two major steps: (1) obtain pseudo masks for the training images based on their ground-truth image-level labels; and (2) train a fully supervised instance segmentation method on these pseudo masks (shown in the above figure).
This framework is built around two components: a network that generates pseudo masks by training a PRM on image-level labels and leveraging object proposal methods, and a Mask R-CNN is a fully supervised instance segmentation method.
Segmentation by Pixel label estimator
This model is proposed by Gaurav Pandey et al that learns to generate segmentation masks from image-level labels alone, without the use of localization cues or saliency masks. On the output of a CNN, we apply a pixel-label loss as well as a neighbourhood loss. Because real pixel labels are unavailable, the CNN output is mapped to auxiliary pixel labels to obtain an approximate segmentation mask.
The neighbourhood loss enforces the constraints imposed by the conditional random field on the CNN output, forcing it to generate crisp segmentation masks that align with the object’s boundary.
Below is the architecture of this model.
As shown above, A fully convolutional network is used to generate a distribution over-segmentation masks p(z|x) from the input image. To generate qaux(z|x), the pixel label estimator incorporates image label information into the distribution.
It forces the segmentation network’s output to be close to this updated distribution. At the same time, the neighbourhood loss forces the segmentation network’s output to be close to the distribution computed from its neighbours.
The procedure can be elaborated upon. A segmentation network is fed an image, and the output is a distribution over the labels for each pixel location p(z|x). This distribution is known as the predicted distribution because it is the only one that will be required during inference. To make certain that the predicted distribution is a valid segmentation mask for the input image.
As a result, it imposes a number of losses on the predicted distribution. The pixel-label estimator, in particular, incorporates image-label information into the predicted distribution to generate a distribution over pixel-level labels qaux.
Because the true pixel-level labels are not available, this distribution can be thought of as an auxiliary ground truth. The auxiliary ground truth is used to train the segmentation network. Next, the neighbourhood estimator computes a smooth version of the output distribution by averaging the output of the neighbours for each location.
Through this post, we have discussed image segmentation under which we have seen what is semantic segmentation, instance segmentation and major types of supervision that are used when performing segmentation tasks. Lastly, we discussed two methods of image segmentation based on image-level supervision.
The first method employs a two-stage pipeline for image-level label training. It uses class activation maps with a peak stimulation layer in the first stage. In the second stage, Mask R-CNN is used to train on the pseudo masks in a fully supervised fashion. The second model is based on image-level labels and is based on weakly-supervised semantic image segmentation.