We all have audienced the fantastic deep learning approaches that have regularly or empirically, demonstrated better than ever success each and every time in learning image representation tasks, such as image captioning, semantic segmentation, object detection, and so on. With a wide variety of convolutional neural networks on our hands, this has enabled us to capture the hypothesis of spatial locality (meaning relevant data elements to be arranged and accessed one by one in a line) in image data structures. In this article, as you can guess from the name, let us take a deeper dive into a specific video task, namely Action Recognition.
In layman’s language, if I can lay it out to someone, it would be a task involving identifying different actions from video clips. Simple enough? Now, video here is considered as a sequence of 2-dimensional frames running after one another. Yes, that is FPS(Frames Per Second) if you wonder for those who don’t know. If you think about it, this seems like a simple and easy extension of image classification tasks applied to multiple images, aka frames here in our case. After this, all needed is that we have to aggregate the predictions from each frame. Yes, that’s it.
Despite the stiff and humongous success of deep learning architectures in image classification, in ImageNet, progress has been slower for video classification. We, in this article, are going to try 3D Covnet or simply I3D as mentioned by the authors Joao Carreira and Andrew Zisserman in their paper “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset” published in 2017 as a CVPR(Conference on Computer Vision and Pattern Recognition) conference paper.
A new architecture was introduced in the paper, as mentioned in the name above for video classification. This spectacular model achieved state-of-the-art results on the HMDB51 and UCF101 datasets. Moreover, when pre-trained on Kinetics dataset, it performed extremely well and was placed first in the CVPR 2017 Charades challenge.
In simple terms, the architecture of inflated 3D CNN model goes something like this – input is a video, 3D input as in 2-dimensional frame with time as the third dimension. It contains Convolutional(CNN) layers with stride 2, after which there is a max-pooling layer and multiple Inception modules (conv. Layers with one max pooling layer, concatenation is the main task). Inflated because of the reason that we are having these modules (described in the paper) dilated into the middle of the model. These modules can have different mini architectures in them like LSTM, Two streams and so on are mentioned in the paper. In the end, we have an average pooling layer with a 1x1x1 Conv layer for prediction.
Let’s try out this model with some code!!
Code Implementation of Inflated 3D CNN
Setup and Importing Dependencies
imageio provides an easy interface for reading and writing a wide range of image data.
OpenCV provides bindings for computer vision problems.
!pip install -q imageio !pip install -q git+https://github.com/tensorflow/docs !pip install -q opencv-python
Importing for displaying the image in the cell itself (notebook).
Verbosity is a term used to describe the amount of information to be viewed in Output.
import imageio from IPython import display logging.set_verbosity(logging.ERROR) import os import random import re
Importing hub for model used, tempfile creates temporary files and directories, ssl module provides access to Transport Layer Security.
import tensorflow as tf import tensorflow_hub as hub from tensorflow_docs.vis import embed import tempfile import ssl import numpy as np
Absl library is used for making applications
from IPython import display from absl import logging import imageio import cv2
Make sure you have Python 3; else below line won’t work.
from urllib import request
We are going to use the UCF dataset using the below URL.
UCF_URL = "https://www.crcv.ucf.edu/THUMOS14/UCF101/UCF101/" VIDEO_LIST = None
The temporary directory, as mentioned before, is created below.
CACHE_DIR = tempfile.mkdtemp() unverified = ssl._create_unverified_context()
Helper Functions
Listing all the videos available in the UCF101 dataset.
def list_ucf_videos(): global VIDEO_LIST if not VIDEO_LIST: index = request.urlopen(UCF_URL, context=unverified).read().decode("utf-8") videos = re.findall("(v_[\w_]+\.avi)", index) VIDEO_LIST = sorted(set(videos)) return list(VIDEO_LIST)
Fetching a video and cache into the local file system.
def fetch(video): ''' condition for existing video ''' path = os.path.join(CACHE_DIR, video) ''' for a new video, define a , path using requests, url ''' if not os.path.exists(path): url_path = request.urljoin(UCF_URL, video) print("Fetching %s => %s" % (url_path, path)) data = request.urlopen(urlpath, context=unverified).read() ''' writing all of this into the file ''' open(path, "wb").write(data) return path
Opening video files using CV2.
def crop_center(frame): ''' frame shape ''' y, x = frame.shape[0:2] min_dimension = min(y, x) ''' setting start points for both the dimensions ''' starting_x = (x // 2) - (min_dimension // 2) starting_y = (y // 2) - (min_dimension // 2) ''' returning limits to dimensions ''' return frame[starting_y:starting_y+min_dimension,starting_x:starting_x+min_dimension] Video Manipulation, Preprocessing by calling above functions. def load(path, max_frames=0, resize=(224, 224)): ''' variable to capture paths ''' cap = cv2.VideoCapture(path) frames = [] try: while True: ret, frame = cap.read() if not ret: break ''' applying all above mentioned functions video processing ''' frame = crop_center(frame) frame = cv2.resize(frame, resize) frame = frame[:, :, [2, 1, 0]] frames.append(frame) if len(frames) == max_frames: break finally: cap.release() ''' dividing by 255 to get values b/w 0-1 ''' return np.array(frames) / 255.0 Slicing into small gifs which will take the image as input. def gif(images): ''' cliping the images for gif ''' converted = np.clip(images * 255, 0, 255).astype(np.uint8) ''' save gif of 25 frames ''' imageio.mimsave('./animation.gif', converted, fps=25) return embed.embed_file('./animation.gif')
NOTE: the following snippet has been taken from Kinetics GitHub.
# Get the kinetics-400 action labels from the GitHub repository. KINETICS_URL = "https://raw.githubusercontent.com/deepmind/kinetics-i3d/master/data/label_map.txt" with request.urlopen(KINETICS_URL) as obj: labels = [line.decode("utf-8").strip() for line in obj.readlines()] print("Found %d labels." % len(labels))
Using the dataset
videos = ucf_videos() ''' empty dict. for storing ''' categories = {} ''' running loop on videos from above function called ''' for video in videos: cat = video[2:-12] ''' if not present add the video ''' if catnot in categories: categories[cat] = [] categories[cat].append(video) ''' string formatting for showing output easily ''' print("Found %d videos in %d categories." % (len(videos), len(categories))) for cat, seq in categories.items(): ''' join is used to remove the seperator and concatenate the objects ''' summary = ", ".join(seq[:2]) print("%-20s %4d videos (%s, ...)" % (category, len(seq), summary))
Load, Run the model
''' instantiate the model from hub using a variable name ''' model = hub.load("https://tfhub.dev/deepmind/i3d-kinetics-400/1").signatures['default']
Let’s have a look at a sample video.
''' getting a sample video ''' path = fetch("v_CricketShot_g04_c02.avi") sample_video = load(path)
Running the model, feeding the input and printing the data.
def pred(sample_video): ''' Model input ''' model_input = tf.constant(sample_video, dtype=tf.float32)[tf.newaxis, ...] ''' saving logits and probabilities of each prediction ''' log = model(model_input)['default'][0] prob = tf.nn.softmax(logits) print("Printing Top 5 actions:") for i in np.argsort(prob)[::-1][:5]: print(f" {labels[i]:22}: {prob[i] * 100:5.2f}%") ''' calling the function for prediction ''' pred(sample_video)
The output should be as below.
EndNote
As we can see the model has predicted the correct classification with an outstanding probability. I highly recommend using inflated 3D CNN model for different datasets mentioned in the article’s introduction with different videos. Try altering the frames, length of the gif if it makes any change to the probability. You can also try newer models like MovieNet, TinyVideo.