Computer Vision Primer: How AI Sees An Image

Image © Close up women eye scanning technology in the futuristic, operation, eye cataract.


Computer vision technology is being dominated by the Convolutional Neural Network (ConvNet or ConvNet) because of its high accuracy. As we all know, a computer reads data in the form of numbers and uses different mathematical computation functions to build and produce certain results. Let us look at how the computer “reads” an image with the help of ConvNet — that is, how it collects the data from the image and computes it.



Sign up for your weekly dose of what's up in emerging technology.

All of us have been “artists” at some point in our lives and have built and drawn coloured figures with different shades to give it a beautiful texture. The way in which we differentiate what we see is through a human’s perception of colour. Let us look at an example of different colours on the pallet. We have experimented with a wide range of colours, mixed them together to give a darker, lighter or a different shade of colour depending on the ratio we use.

Similarly, an AI reads these colours with a range of values from 0 to 255. The image below reminds us of how we used to use this combination to remember the six major colours. Its called as an RGB model. Here, Red, Green and Blue are the dominant colours, which when combined together give another set of colours — Magenta, Cyan and Yellow. Below is a representational  image an experience we have gone through. We have been introduced to this visual perception which tells us that white is a combination of multiple colours. Similarly, our physics experiment on the prism gives us the same information on how a white light splits into different colours.

What Does A Computer See?

Every image on our computer screen is nothing but a combination of the three major colours, red, green and blue. There are a lot of models being used in the computer vision field, which are nothing but a combination of various colour values. We will be using a simple RGB model to help us understand how a computer looks at an image. Let us see what an RGB model is.

An RGB model is one of the oldest type of colour differentiation tool being used in the computer vision industry. As every major colour has a range from 0 to 255, we can infer that higher the value, the brighter is the colour. Let us differentiate every colour into sub categories creating a colour pallet.

Traversing From 0 to 255:

In the image above, every value of red, green and blue represents a particular shade of colour. These numbers are used by the AI to read the image and process it. When we combine two colours, say red and green, the resulting colour is Yellow. It is represented in the three dimensional space as 255,255,0 (R,G,B). With combination of RGB together in the .gif above, we can look at various colours such as Cyan, Magenta and Yellow. This is the technique used in computer vision to give a particular colour to a pixel.

An image is made up of pixels placed adjacent to one another. These coloured pixels are made up of three channels which are placed one behind the another. All the channels add up to give a specific colour and these pixels placed together give a shape to a figure in an image. Let us look at how an AI reads an image. By implementing the RGB model we can extract over 16 million (16,777,216, to be exact) shades of colours. Let us look at a combination of RGB which makes up different pixels.

RGB Values Of A Pixel In An Image:

The above pixelated image is a combination of the three channels which are placed one behind the other. Lets us look at how these channels are placed in order to get our pixelated image like the above.

This is how an image is formed on a computer screen. The values pertaining to red green and blue are read by an AI and stored in the form of a matrix. Let us read this image in Python and check the values of the dimensions of the pixels — 4×4 gives us 16 pixels. There are some difference in the values because we have created the above images in Windows Paint by giving specific RGB values and read the image in Python. The actual image size was 222×217 and we used K – Nearest Neighbour (k-NN) to resize the image into a 4×4 pixels with a depth of 3 to understand the image better. Next challenge was to search this image in a collection of images to check if it belonged to the actual image from which it was taken.

Image Recognition With ConvNet:

Now the AI has reads every pixel of the image, extracts the values of RGB and stores it in the memory. It starts searching for similar images in the database to find a match. But how does this work? What’s the principle behind the comparison of different images? Let us look into it

In machine learning, ConvNet are complex feed forward neural networks. ConvNets is used for image classification and recognition because of its high accuracy. It was proposed by computer scientist Yann LeCun in the late ‘90s, when he was inspired from the human visual perception of recognising things. The ConvNet follows a hierarchical model which works on building a network, like a funnel, and finally gives out a fully-connected layer where all the neurons are connected to each other and the output is processed.

During the training of ConvNet, as the images are being trained, the hidden layer is where the image gets broken down. This is called convolution. Every image consists of n x m number of pixels with a certain depth (RGB has 3 and GrayScale has 1). In a ConvNet, these images are broken down with two processes, that is, filtering and pooling and are squashed into a small image. Filtering is a process in which a stride of weight matrix is passed over the whole image in a number of iterations to get the dot product of the initial weights as well as the pixel values. This is followed by a pooling layer which reduces the size of a image into a low dimensional matrix, also highlighting a pixel, which gives the highest information about the image. Let us look at this .gif which gives us a visual understanding of the process.

Once we have a convoluted layer, we can build a fully-connected layer to form an array. This information can be compared to the images to find the probabilities or likelihood of that portion being included in the image. Having a threshold at 50%, we can start predicting what kind or what the image is. One of the commonly used activation functions in this process is a softmax function. We can change the parameters and build the network with high number of iterations and higher number of layers to yield better results. This will depend on the computational power of your system.


While building a ConvNet can be time consuming, the results are fascinating, and totally worth the effort. Therefore, it is no surprise that this method is most popular in AI, as the outcomes are significantly better than other computer vision techniques such as OpenCV. Apart from RGB other colour models such as HSI and CMYK also yield good results.

More Great AIM Stories

Kishan Maladkar
Kishan Maladkar holds a degree in Electronics and Communication Engineering, exploring the field of Machine Learning and Artificial Intelligence. A Data Science Enthusiast who loves to read about the computational engineering and contribute towards the technology shaping our world. He is a Data Scientist by day and Gamer by night.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM