Yann LeCun’s earliest breakthroughs came with the invention of Convolutional Neural Networks (ConvNets). He first introduced them in the 1980s when he was a postdoctoral research associate at the University of Toronto. Inspired by the earlier works of Japanese computer scientist Kunihiko Fukushima, ConvNets were modelled after the brain’s visual cortex, a part that handles sight.
Over the years, ConvNets’ popularity grew by leaps and bounds. This popularity can be attributed largely to its architecture, effectiveness, and accuracy. They have been widely adopted for a large number of industrial applications like recommender systems, natural language processing, etc.
Sign up for your weekly dose of what's up in emerging technology.
That said, one cannot discount ConvNets of its several flaws. Some of these limitations are very fundamental, pushing users to prefer other models over ConvNets. An example of one such model is Transformer. Initially used extensively for language processing applications, its scope has expanded to computer vision, TinyML, among others.
Is it the beginning of the end for ConvNets?
ConvNets and their limitations
ConvNets learn everything end-to-end. They combine evidence and generalise across positions. ConvNets use layers of feature detectors, and each of these feature detectors is local and repeated across space. One of the key challenges with computer vision is data variance in the real world. The human vision system can recognise objects from different angles, backgrounds, and even under different lighting conditions. In a case where objects are partially obstructed, the vision system uses cues to fill in the missing information.
While ConvNets are designed well enough to cope with translations, meaning they can correctly identify the position of the object in the image, the same cannot be said for dealing with the effects of changing viewpoints like rotations and scaling. ConvNets cannot handle rotation at all. In a speech, Goeff Hinton said that ConvNets could not deal with handedness detection at all. This means that if a ConvNet is trained on both left and right shoes, it would not be able to tell the difference between the two.
According to Hinton, one of the ways to solve this is by using 4D or 6D maps for training AI to perform object detection. This, however, is very expensive. For the present time, researchers just gather a lot of images that display the object in various positions. This, again, is not a very efficient method. Hinton said, “We’d like neural nets that generalise to new viewpoints effortlessly. If they learn to recognise something, and you make it ten times as big, and you rotate it 60 degrees, it shouldn’t cause them any problem at all. We know computer graphics is like that, and we’d like to make neural nets more like that.”
Another major disadvantage with ConvNets is the pooling layers. Pooling in ConvNets is for generalising features and helping the network recognise the feature independent of its location in the image. Pooling is especially useful in an image classification task where the user has to detect the presence of a certain object in the image but are not very concerned about its location. Pooling leads to increased efficiency of the network and leads faster training. Location variance can improve the statistical efficiency of the network.
That said, pooling layers lead to a loss of valuable information, and it ignores the larger relationship between the part and the whole. For example, if we are considering a face detector, we have to combine features like mouth, eyes, and a nose present at the correct location for it to classify as a face. A ConvNet will classify it as a face if these features are present, whether or not they are placed at the correct location.
To this end, Hinton and his team filed a patent on Capsule Neural Network as a replacement for ConvNets. The researchers had claimed they could replace ConvNets for traditional computer vision applications. This model could not only figure out the feature but also identify its position in the image.
Not just limited to weak generalisations
ConvNets recognise objects in a very different way than humans. These differences are not limited to weak generalisations. Adding even a tiny bit of noise to an image would lead ConvNets to recognise it as completely different.
Given the limitations of ConvNets, other models continue to soar in popularity, more prominently Transformers. After the success of large language models like GPT-2 and GPT-3, Transformers have been successfully deployed for computer vision applications. Vision Transformer, developed by Google’s team, is an image classification model that deploys transformer architecture over patches of the image.