In computer vision, object detection is a task that detects required objects from the set of different objects presented in any image. There are various uses of object detection like extracting information about the license plate of vehicles from a traffic signal or detecting different unwanted objects from the X-ray or sonography image of the human body to identify the name of a disease.
Since CNNs are the widely used models to complete the tasks of image processing and computer vision, we have got some good models also for object detection tasks based on CNN. Going in the case of object detection where CNN is very helpful, we have three region-based CNN models which we call R-CNN, Fast R-CNN and Faster R-CNN. In this article, we will discuss these three models along with the basic features of these models and we will also try to understand how they differ from each other. The major points to be discussed in this article are listed below.
Table of contents
- Object Detection
- What is R-CNN?
- Tasks performed by R-CNN
- Selective search algorithm
- Warping
- Extracting features with a CNN
- Classification
- SPPNet
- Fast R-CNN
- Faster R-CNN
- Comparing R-CNN, Fast R-CNN and Faster R-CNN
As our article is based on the task of object detection, let us understand it with the help of an example. The below image shows an instance of object detection. Where an object detector recognizes and labelling different objects presented in the image.

Image source
Now let us discuss what are the different popular algorithms used for object detection that are based on the CNN model. First, we will come to know about three popular models – R-CNN, Fast-RCNN and Faster R-CNN and then finally we will do a comparative analysis of these models.
What is R-CNN?
R-CNNs ( Region-based Convolutional Neural Networks) are a family of machine learning models used in computer vision and image processing. Specially designed for object detection, the original goal of any R-CNN is to detect objects in any input image defining boundaries around them.
An input image given to the R-CNN model goes through a mechanism called selective search to extract information about the region of interest. Region of interest can be represented by the rectangle boundaries. Depending on the scenario there can be over 2000 regions of interest. This region of interest goes through CNN to produce output features. These output features then go through the SVM(support vector machine) classifier to classify the objects presented under a region of interest.

The above image represents the procedures of an R-CNN while detecting an object using it. Using the R-CNN within an image we extract regions of interest using the region extraction algorithm. The number of regions can be extended to 2000. For each region of interest, the model manages the size to be fitted for the CNN, where CNN computes the features of the region and SVM classifiers classify what objects are presented in the region.
Tasks performed by R-CNN
The following tasks are performed by R-CNN:
Selective Search
There can be various approaches to perform object localization in any object detection procedure. Using sliding filters of different sizes on the image to extract the object from the image can be one approach that we call an exhaustive search approach. As the number of filters or windows will increase, the computation effort will increase in an exhaustive search approach.
The selective search algorithm uses exhaustive search but instead of using it alone it also works with the segmentation of the colours presented in the image. More formally we can say selective search is a method that separates objects from an image by providing different colours to the object.
This algorithm starts with making many small windows or filters and uses the greedy algorithm to grow the region. Then it locates the similar colours in the regions and merges them together.
The similarity between the regions can be calculated by:
S(a,b)=Stexture(a,b)+Ssize(a,b)
Where the Stexture(a,b) is visual similarity and Ssize(a,b) similarity between the regions.
Using this algorithm, the model continues to merge all the regions together to improve the size of the regions. The image is a representation of a selective search algorithm.

In the image, we can see the making of tiny regions to the selection of the objects, space under the region increases as the similarity between regions increases.
Selective search algorithms are a basic phenomenon for object localization. In object detection after localization, there are three processes left from which an extracted object will go.
- Warping
- Extracting features with a CNN
- Classification
Warping
After selection of the region, the image with regions goes through a CNN where the CNN model extracts the objects from the region. Since the size of the image should be fixed according to the capacity of CNN we require some time or most of the time to reshape the image. In basic R-CNN we wrap the region into 227 x 227 x 3 size images.

Extract objects with a CNN
A wrapped input for CNN will be processed to extract the object of size 4096 dimensions.

Classification
The basic R-CNN consists of an SVM classifier to segregate different objects into their class.

The whole process architecture of R-CNN can be represented as.

At the end of the model, the boundary box regressor works for defining objects in the image by covering the image by the rectangle.
SPPNet
Our next model, which is Fast R-CNN is inspired by the SPPNet (Spatial Pyramid Pooling Network), so we should discuss in brief the working of SPPNet.
Basic R-CNN is very slow in training and testing because 2000 regions need to be calculated to complete the process and each region goes through the CNN where it takes a lot of time to extract the feature. In SPPNet instead of working on the 2000 region to convert them into feature maps, it converts the whole image on the feature map mat once.

The above image represents the architecture of the R-CNN and SPPNet. In SPPnet, it uses a maximum pooling layer to extract the most highlighted color from a pixel matrix. Which causes the wrapping of the region of interest below the image to represent the whole image as a feature map.

We can clearly see above that the effect of maximum pooling layer black colour is more highlighted than other light colours which also helps in finding the dark coloured objects in the image.
We pass it to a fully connected network and use an SVM for classification and a linear regressor for the bounding box.

Fast R-CNN
In fast R-CNN instead of performing maximum pooling, we perform ROI pooling for utilising a single feature map for all the regions. This warps ROIs into one single layer; the ROI pooling layer uses max pooling to convert the features.
Since max pooling is also working here, that’s why we can consider fast R-CNN as an upgrade of the SPPNet. Instead of generating layers in a pyramid shape, it generates only one layer.


The above image shows a fully connected network for classification using linear regression and softmax. The bounding box is further refined with linear regression. Fast R-CNN is faster than SPPNet.
Faster R-CNN
Till now we have seen in the article for region proposals that SPPNet and Fast R-CNN did not have any methods for choosing regions of interest. This is the basic difference between the Fast R-CNN and Faster R-CNN. Faster R-CNN uses a region proposal method to create the sets of regions. Faster R-CNN possesses an extra CNN for gaining the regional proposal, which we call the regional proposal network. In the training region, the proposal network takes the feature map as input and outputs region proposals. And these proposals go to the ROI pooling layer for further procedure.

Comparing R-CNN, Fast R-CNN and Faster R-CNN
Now, let us compare the important features of all these models that we have gone through.
R-CNN | Fast R-CNN | Faster R-CNN | |
region proposals method | Selective search | Selective search | Region proposal network |
Prediction timing | 40-50 sec | 2 seconds | 0.2 seconds |
computation | High computation time | High computation time | Low computation time |
The mAP on Pascal VOC 2007 test dataset(%) | 58.5 | 66.9 (when trained with VOC 2007 only) 70.0 (when trained with VOC 2007 and 2012 both) | 69.9(when trained with VOC 2007 only) |
The mAP on Pascal VOC 2012 test dataset (%) | 53.3 | 65.7 (when trained with VOC 2012 only) 68.4 (when trained with VOC 2007 and 2012 both) | 67.0(when trained with VOC 2012 only) 70.4 (when trained with VOC 2007 and 2012 both) 75.9(when trained with VOC 2007 and 2012 and COCO) |
Final Words
In this article, we have seen the different models of the R-CNN family and how they are different from each other. We have seen how different pooling methods and region proposal methods make changes and also they make the process faster. Object detection is a fascinating field. In modern object detection scenarios, there are few new algorithms like YOLO and RetinaNet which can also help to learn your model fast and accurately.
References: