MITB Banner

Action Recognition Using Inflated 3D CNN

Inflated 3D CNN

We all have audienced the fantastic deep learning approaches that have regularly or empirically, demonstrated better than ever success each and every time in learning image representation tasks, such as image captioning, semantic segmentation, object detection, and so on. With a wide variety of convolutional neural networks on our hands, this has enabled us to capture the hypothesis of spatial locality (meaning relevant data elements to be arranged and accessed one by one in a line) in image data structures. In this article, as you can guess from the name, let us take a deeper dive into a specific video task, namely Action Recognition.

In layman’s language, if I can lay it out to someone, it would be a task involving identifying different actions from video clips. Simple enough? Now, video here is considered as a sequence of 2-dimensional frames running after one another. Yes, that is FPS(Frames Per Second) if you wonder for those who don’t know. If you think about it, this seems like a simple and easy extension of image classification tasks applied to multiple images, aka frames here in our case. After this, all needed is that we have to aggregate the predictions from each frame. Yes, that’s it.  

Despite the stiff and humongous success of deep learning architectures in image classification, in ImageNet,  progress has been slower for video classification. We, in this article, are going to try 3D Covnet or simply I3D as mentioned by the authors Joao Carreira and Andrew Zisserman in their paper “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset” published in 2017 as a CVPR(Conference on Computer Vision and Pattern Recognition) conference paper. 

A new architecture was introduced in the paper, as mentioned in the name above for video classification. This spectacular model achieved state-of-the-art results on the HMDB51 and UCF101 datasets. Moreover, when pre-trained on Kinetics dataset, it performed extremely well and was placed first in the CVPR 2017 Charades challenge.

In simple terms, the architecture of inflated 3D CNN model goes something like this – input is a video, 3D input as in 2-dimensional frame with time as the third dimension. It contains Convolutional(CNN) layers with stride 2, after which there is a max-pooling layer and multiple Inception modules (conv. Layers with one max pooling layer, concatenation is the main task). Inflated because of the reason that we are having these modules (described in the paper) dilated into the middle of the model. These modules can have different mini architectures in them like LSTM, Two streams and so on are mentioned in the paper. In the end, we have an average pooling layer with a 1x1x1 Conv layer for prediction.

Let’s try out this model with some code!!

Code Implementation of Inflated 3D CNN

Setup and Importing Dependencies

imageio provides an easy interface for reading and writing a wide range of image data.

OpenCV provides bindings for computer vision problems. 

 !pip install -q imageio
 !pip install -q git+https://github.com/tensorflow/docs
 !pip install -q opencv-python 

Importing for displaying the image in the cell itself (notebook).

Verbosity is a term used to describe the amount of information to be viewed in Output.

 import imageio
 from IPython import display
 logging.set_verbosity(logging.ERROR)
 import os
 import random
 import re 

Importing hub for model used, tempfile creates temporary files and directories,  ssl module provides access to Transport Layer Security.

 import tensorflow as tf
 import tensorflow_hub as hub
 from tensorflow_docs.vis import embed
 import tempfile
 import ssl
 import numpy as np 

Absl library is used for making applications 

 from IPython import display
 from absl import logging
 import imageio
 import cv2 

Make sure you have Python 3; else below line won’t work.

from urllib import request 

We are going to use the UCF dataset using the below URL.

 UCF_URL = "https://www.crcv.ucf.edu/THUMOS14/UCF101/UCF101/"
 VIDEO_LIST = None 

The temporary directory, as mentioned before, is created below.

 CACHE_DIR = tempfile.mkdtemp()
 unverified = ssl._create_unverified_context() 
Helper Functions 

Listing all the videos available in the UCF101 dataset.

 def list_ucf_videos():
   global VIDEO_LIST
   if not VIDEO_LIST:
     index = request.urlopen(UCF_URL, context=unverified).read().decode("utf-8")
     videos = re.findall("(v_[\w_]+\.avi)", index)
     VIDEO_LIST = sorted(set(videos))
   return list(VIDEO_LIST) 

Fetching a video and cache into the local file system.

 def fetch(video):
   '''
   condition for existing video
   '''
   path = os.path.join(CACHE_DIR, video)
   '''
   for a new video, define a ,
   path using requests, url
   '''
   if not os.path.exists(path):
     url_path = request.urljoin(UCF_URL, video)
     print("Fetching %s => %s" % (url_path, path))
     data = request.urlopen(urlpath, context=unverified).read()
     '''
     writing all of this into the file 
     '''
     open(path, "wb").write(data)
   return path 

Opening video files using CV2.

 def crop_center(frame):
   '''
   frame shape 
   '''
   y, x = frame.shape[0:2]
   min_dimension = min(y, x)
   '''
   setting start points for 
   both the dimensions
   '''
   starting_x = (x // 2) - (min_dimension // 2)
   starting_y = (y // 2) - (min_dimension // 2)
   '''
   returning limits to dimensions
   '''
   return frame[starting_y:starting_y+min_dimension,starting_x:starting_x+min_dimension]
 Video Manipulation, Preprocessing by calling above functions.
 def load(path, max_frames=0, resize=(224, 224)):
   '''
   variable to capture paths
   '''
   cap = cv2.VideoCapture(path)
   frames = []
   try:
     while True:
       ret, frame = cap.read()
       if not ret:
         break
       '''
       applying all above mentioned functions 
       video processing
       '''  
       frame = crop_center(frame)
       frame = cv2.resize(frame, resize)
       frame = frame[:, :, [2, 1, 0]]
       frames.append(frame)
       if len(frames) == max_frames:
         break
   finally:
     cap.release()
   '''
   dividing by 255 to get values 
   b/w 0-1
   '''  
   return np.array(frames) / 255.0
 Slicing into small gifs which will take the image as input.
 def gif(images):
   '''
   cliping the images for gif
   '''
   converted = np.clip(images * 255, 0, 255).astype(np.uint8)
   '''
   save gif of 25 frames 
   '''
   imageio.mimsave('./animation.gif', converted, fps=25)
   return embed.embed_file('./animation.gif') 

NOTE: the following snippet has been taken from Kinetics GitHub.

 # Get the kinetics-400 action labels from the GitHub repository.
 KINETICS_URL = "https://raw.githubusercontent.com/deepmind/kinetics-i3d/master/data/label_map.txt"
 with request.urlopen(KINETICS_URL) as obj:
   labels = [line.decode("utf-8").strip() for line in obj.readlines()]
 print("Found %d labels." % len(labels)) 

Using the dataset

 videos = ucf_videos()
 '''
 empty dict. for storing
 '''
 categories = {}
 '''
 running loop on videos 
 from above function called
 '''
 for video in videos:
   cat = video[2:-12]
   '''
   if not present add the video
   '''
   if catnot in categories:
     categories[cat] = []
   categories[cat].append(video)
   '''
   string formatting for showing output
   easily
   '''
 print("Found %d videos in %d categories." % (len(videos), len(categories)))
 for cat, seq in categories.items():
   '''
   join is used to remove the seperator and 
   concatenate the objects
   '''
   summary = ", ".join(seq[:2])
   print("%-20s %4d videos (%s, ...)" % (category, len(seq), summary)) 
Load, Run the model
 '''
 instantiate the model from hub using a variable name
 '''
 model = hub.load("https://tfhub.dev/deepmind/i3d-kinetics-400/1").signatures['default'] 

Let’s have a look at a sample video.

 '''
 getting a sample video
 '''
 path = fetch("v_CricketShot_g04_c02.avi")
 sample_video = load(path) 

Running the model, feeding the input and printing the data. 

 def pred(sample_video):
   '''
   Model input 
   '''  
   model_input = tf.constant(sample_video, dtype=tf.float32)[tf.newaxis, ...]
   '''
   saving logits and probabilities of each 
   prediction
   '''
   log = model(model_input)['default'][0]
   prob = tf.nn.softmax(logits)
   print("Printing Top 5 actions:")
   for i in np.argsort(prob)[::-1][:5]:
     print(f"  {labels[i]:22}: {prob[i] * 100:5.2f}%")
 '''
 calling the function for 
 prediction
 '''
 pred(sample_video) 

The output should be as below.

EndNote

As we can see the model has predicted the correct classification with an outstanding probability. I highly recommend using inflated 3D CNN model for different datasets mentioned in the article’s introduction with different videos. Try altering the frames, length of the gif if it makes any change to the probability. You can also try newer models like MovieNet, TinyVideo

References:

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Mudit Rustagi

Mudit Rustagi

Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories