Now Reading
How I Created A ML Model That Identifies Hand Gestures

How I Created A ML Model That Identifies Hand Gestures

Rushi Bhagat

Hand-gesture detection and recognition are one of the hottest topics around the last few decades and many data scientists and researchers were successful in implementing this for the blind-interpreter, augmentation-reality and hand-controlled robots. 

In general definition, Gesture is “a movement of a part of a body like hand or head which intends to express an idea or a meaning”. The research on evolution suggests that manual gestures was the first step taken towards the process of communication in human history. And the fact is even the newborns use hand gestures to express their desires which is long before they start speaking. Similarly, gestures can also be used to communicate with machines to express or make any action.

The traditional method used for gesture recognition was only possible with the use of external hardware controllers or it required wired gloves which can register the user’s intentions from hand and arm movements. The Microsoft’s Kinect, introduced in November’10, is one of the best-known examples of such hardware devices and it also set a Guinness World Record for the fastest-selling consumer device when it was launched. But the modern approach tends to highly rely on Deep Learning Algorithms and Computer Vision technologies, and not including any hardware devices.

Collectively, this whole process can be named as AirGesture as you don’t have to touch the screen or your keyboard to communicate with the machines.

The flow of implementation of creating a model is:

The following is the basic implementation of creating model for predicting hand gestures:

  • Data Creation/Using Datasets:

Our data includes the images of the different kinds of hand gestures those have been taken from the webcam or any existing datasets.

The hand gesture recognition dataset is presented, composed by a set of near infrared images acquired by the Leap Motion Sensor. The database is composed of 10 different hand-gestures that were performed by 10 different subjects (5 men and 5 women). And there are total 40000 images in total.

  • Data Pre-processing

The images in our dataset are further resized to 50x50x1 binary format and further those images are converted to numpy arrays to make this suitable for tensor processing in training. And if you have less amount of training data then further we can use data augmentation.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import cv2
import os                   
import pandas as pd
from tensorflow import keras
from keras.models import Sequential
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers import Dense, Flatten, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from tqdm import tqdm
from random import shuffle  
from zipfile import ZipFile
from PIL import Image

Code Snippet – 1 : Import all the dependencies/libraries required

lookup = dict()
reverselookup = dict()
count = 0
for j in os.listdir(‘../input/leapgestrecog/leapGestRecog/00/’):
    if not j.startswith(‘.’): # If running this code locally, this is to 
                              # ensure you aren’t reading in hidden folders
        lookup[j] = count
        reverselookup[count] = j
        count = count + 1

Code Snippet – 2: Looking up into the dataset

These are the 10 different types of gestures in the dataset.

x_data = []
y_data = []
IMG_SIZE = 150
datacount = 0 
for i in range(0, 10): 
    for j in os.listdir(‘../input/leapgestrecog/leapGestRecog/0’ + str(i) + ‘/’):
        if not j.startswith(‘.’): 
            count = 0 
            for k in os.listdir(‘../input/leapgestrecog/leapGestRecog/0’ + str(i) + ‘/’ + j + ‘/’):
                path = ‘../input/leapgestrecog/leapGestRecog/0’ + str(i) + ‘/’ + j + ‘/’ + k
                img = cv2.imread(path,cv2.IMREAD_GRAYSCALE)
                img = cv2.resize(img, (IMG_SIZE,IMG_SIZE))
                arr = np.array(img)
                count = count + 1
            y_values = np.full((count, 1), lookup[j]) 
            datacount = datacount + count
x_data = np.array(x_data, dtype = ‘float32’)
y_data = np.array(y_data)
y_data = y_data.reshape(datacount, 1)

Code Snippet – 3 : To manipulate the dataset and use it for further processing

As we have loaded the dataset, now to check what kind of images are there in the dataset the following block of code will let us peek into the dataset.

for i in range(5):
    for j in range (2):

Code Snippet – 4 : To peek into the dataset

  • Dividing dataset into testing and training sets

In a dataset, a training set is implemented to build up a model, while a testing set is to validate the model built. Value points in the training set are excluded from the test set. Usually, a dataset is divided into a training set, a test set in each iteration, or divided into a training set, a validation set and a test set in each iteration.

The following block code will help us to reshape and normalise the dataset, and divide the dataset for training and testing.

x_data = x_data.reshape((datacount, IMG_SIZE, IMG_SIZE, 1))
x_data = x_data/255

Code Snippet – 5 : Train – Test – Split Dataset 

  • Training of Deep Learning Model

Start the by reviewing the packages that are being imported and ensure you have all the dependencies installed. 

  1. Create Convolution Network ( ConvNet )

The main purpose of Convolutional is to extract features from the input image and preserve the spatial relationship between pixels by learning image features using small squares of input data.

The two main hyperparameters are:

  • filters – Integer value, the dimensionality of the output space 
  • kernel_size –  An integer or tuple/list of 2 integers, specifying the height and width of the 2-D convolution window. Can be a single integer to specify the same value for all spatial distances.

Since every image can be considered as a matrix of pixel values. Consider a 5×5 image whose pixel values are only 1 and 0. The convolution of 5×5 image and 3×3 matrix can be computed as shown below:

  1. Apply Max Pooling to layers

Max Pooling iis applied, as it reduces the dimensionality of each feature map but retains the most important information.

  1. Apply Flattening to the output of the layers

While flattening the matrix is converted into a linear array to input it into the nodes of our neural network. 

  • Following is the implementation of Convolutional Neural Net of the Hand Gesture Recognition model:

model = Sequential()
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = ‘Same’,activation =’relu’, input_shape = (IMG_SIZE,IMG_SIZE,1)))
model.add(Conv2D(64, (3, 3), activation=’relu’)) 
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation=’relu’))
model.add(MaxPooling2D((2, 2)))
model.add(Dense(128, activation=’relu’))
model.add(Dense(10, activation=’softmax’))

Code Snippet – 6: Convolution Neural Network for Hand Gesture Recognition

See Also
Google AI Tool

Further we need to compile and fit the model.

#Compiling model
#Fitting the model
History =, y_train, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=(x_test, y_test))

Code Snippet – 7: Final evaluation and fitting model

After training the model, we would like to visualize the loss function as well as the accuracy of the model on training data and test data.

plt.title(‘Model Loss’)
plt.legend([‘train’, ‘test’])

Code Snippet – 8: Visualization of Loss function

plt.title(‘Model Accuracy’)
plt.legend([‘train’, ‘test’])

Code Snippet – 9: Visualization of the Accuracy 

At last, we would like to validate the gestures with the original images and predicted images.

def validate_gestures(predictions_array, true_label_array, img_array):
  class_names = [“down”, “palm”, “l”, “fist”, “fist_moved”, “thumb”, “index”, “ok”, “palm_moved”, “c”] 
  for i in range(1, 10):
    prediction = predictions_array[i]
    true_label = true_label_array[i]
    img = img_array[i]
    img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
    predicted_label = np.argmax(prediction) # Get index of the predicted label from prediction
    if predicted_label == true_label:
      color = ‘blue’
      color = ‘red’
    plt.xlabel(“Predicted: {} {:2.0f}% (Actual: {})”.format(class_names[predicted_label],

Code Snippet – 9: Validate Gestures 

And finally we will call the function we created to validate the gestures.

validate_gestures(prediction, y_test, X_test)

Code Snippet – 10: Calling the validate_gesture function


In above code we have implemented a Convolution Neural Network which will give us a model specifically trained on multiple hand gestures. And further using that model, we can implement such projects where you can use those hand gestures to communicate with the machines. The photo above shows the implementation of one of such projects which I named as AirGesture because you don’t have to touch the keyboard or screen to play this game ( Battle Tank – 1990 ). Similarly, you can implement it for other games as well such as Mario, Google dino etc. 

What Do You Think?

If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top