Computer Vision has seen rapid growth over the last few years, primarily due to deep learning which has allowed the capability to detect obstacles, segment images, or extract important context from a given scene. From the biological standpoint, computer vision strives to come up with computational models of the human visual system. From the engineering point of view, computer vision’s goal is to create autonomous systems which could perform some of the tasks which the human visual system can perform and even surpass it in many cases. But computer vision is an incredibly complex thing to bring to achieve.
For example, a few of the most fundamental difficulties in computer vision can be recognised as how to extract and represent the vast amount of human experience in a computer in such a manner that retrieval is easy, and needs enormous amount of computation to perform tasks such as face recognition or autonomous driving in real-time, and more.
How Hard Can A Vision Task Be In Areas Like Autonomous Driving?
Researchers have been working to understand the vision and the knowledge base required to bring autonomous vehicles to life. Tesla is working on something called Autopilot which is primarily a vision-based system, involving a multi-task neural network. While other technologies might help self-driving vehicles to recognise and avoid obstacles, computer vision helps them to read road signs and follow traffic rules for maximum safety. So, can driving be converted into purely a vision problem and solved with machine learning techniques?
Jitendra Malik, a renowned computer vision expert, says that he is optimistic about fully autonomous driving in the near future. According to him, there will be 0.01% of cases where quite sophisticated cognitive reasoning is called for. Making mistakes when you are driving sixty miles per hour could potentially kill somebody.
According to Malik, most of what we do in vision is done unconsciously or subconsciously. That effortlessness gives us the sense that this must be very easy to implement on a computer. “If you go into the neuroscience of computer vision, then the complexity becomes clear. A large part of the cerebral cortex is dedicated to vision processing. The reality is way more complex than imagined,” he said recently in an interview with Lex Fridman.
Machines See Numbers Not Images
One of the other reasons why computer vision is challenging is that when machines see images, they see them as numbers that represent individual pixels. Whereas humans perceive photos as objects, in a highly visual and intuitive manner. It is certainly difficult for machines to process all that data when training a computer vision model. On top of that, making the machines do complex visual tasks is even more challenging in terms of the required computing and data resources.
According to Malik, there are subsets of the vision-based driving problem which are very solvable like during freeway conditions. But autonomous driving must work in all conditions, no matter what. So, that may need predictive models of behaviours of pedestrians on the road and other agents, which can be incredibly complicated. For example, the system may need a high-level cognitive understanding of what a typical cyclist does, and act accordingly in advance. But the typical behaviour of a cyclist may differ from a pedestrian. Meaning current computer vision systems need far more data than humans do for learning those same capabilities. If we compare this to humans, one may see that we are natural experts at the tasks involving complex computer vision tasks like riding a bike or driving a car.
Static Image Vs Video Computer Vision
Researchers have been focusing heavily on single image processing. Historically, you have to understand the restrictions of computational capabilities we have had. Many of the choices made in the computer vision community through the decades can be understood as choices forced upon them by the lack of computing resources. This led to focusing extensively on single images rather than video.
While today, there are no computational problems, and the need for single image computer vision can be achieved quite comfortably. But video is still understudied because video computation is still quite challenging. You can still train large neural networks with relatively larger video datasets compared to the 90s, but if you want to operate at the scale of all the content on YouTube, it is very challenging.
According to Malik, long-range video understanding is one of the problems of computer vision. He says if you have a video clip and you want to understand the behaviour in terms of agents, their goals, intentionality, and make predictions about what might happen, it is quite challenging. “In the short-range of computer vision, it is only about detecting whether a person is sitting or standing. This is something that we can do right now. But in terms of long-range video understanding, I don’t think we can do today, as it blends into cognition, and that’s the reason why it’s challenging,” Malik told Lex Fridman.